Introduction

Dear Jupyter Notebook reader,

to run the Notebook please install the requirements first, e.g. via pip install -r requirements.txt. All code has been tested on a MacOS machine with 32 GB of RAM and a Python 3.7 environment. Running this notebook may take up to 60 minutes depending on the hardware configuration of your machine. Therefore, a HTML version of this notebook containing all outputs is provided as well.

Imports

In [1]:
# Standard libraries
import time
from dateutil.relativedelta import relativedelta
import os
from itertools import starmap, product
from IPython.core.interactiveshell import InteractiveShell

# Third-party utility libraries
import pandas as pd
import numpy as np
import holidays
import geopy.distance
from scipy import stats

# Machine learning libraries
from fbprophet import Prophet
from fbprophet.diagnostics import cross_validation
from fbprophet.diagnostics import performance_metrics
from statsmodels.graphics.tsaplots import plot_pacf, plot_acf
from statsmodels.tsa.statespace.sarimax import SARIMAX
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler

# Visualisation libraries
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.offline as py
import folium
import plotly.express as px

# Styling 
py.init_notebook_mode()
plt.style.use('tableau-colorblind10')
InteractiveShell.ast_node_interactivity = 'all'
pd.plotting.register_matplotlib_converters()

Data Preparation

Data Cleansing

In [2]:
# Importing raw data
data = pd.read_csv('datafiles/raw/OPENDATA_BOOKING_CALL_A_BIKE.csv',sep=';', dtype={'DISTANCE': 'float64'})

# Filtering the data for relevant location
data = data[data.CITY_RENTAL_ZONE == 'Kassel']
In [3]:
# Transforming column DATE_BOOKING to datetime column
data['DATE_BOOKING'] = pd.to_datetime(data.DATE_BOOKING)

# Setting the index of Data to datetimeindex
data.set_index(data['DATE_BOOKING'], inplace=True)

# Filtering the Dataset to relevant years 
data = data.loc['2015-01-01 00:00:00':'2016-12-31 23:59:59']
In [4]:
# Printing the value counts for columns that seem to have no varying information
for column in ['CATEGORY_HAL_ID', 'TRAVERSE_USE', 'DISTANCE', 'CITY_RENTAL_ZONE', 'TECHNICAL_INCOME_CHANNEL',
               'RENTAL_ZONE_HAL_SRC', 'COMPUTE_EXTRA_BOOKING_FEE']:
    print(data[column].value_counts(), '\n')
    
# Checking if DATE_BOOKING is the same as DATE_FROM
print("DATE_BOOKING = DATE_FROM \n",(data.DATE_BOOKING == pd.to_datetime(data.DATE_FROM)).value_counts())
53000    351480
Name: CATEGORY_HAL_ID, dtype: int64 

Nein    351480
Name: TRAVERSE_USE, dtype: int64 

0.0    351480
Name: DISTANCE, dtype: int64 

Kassel    351480
Name: CITY_RENTAL_ZONE, dtype: int64 

IVR                         139115
Android KON                 116500
iPhone KON                   83114
Android CAB                   2218
iPhone CAB                    1311
Windows                        458
Android SRH                    142
Terminal KS_1 (- 14 -)         127
iPhone SRH                     105
Terminal KS_22 (- 212 -)        92
Terminal KS_1 (- 11 -)          35
Android                          5
Name: TECHNICAL_INCOME_CHANNEL, dtype: int64 

Standort    351480
Name: RENTAL_ZONE_HAL_SRC, dtype: int64 

Nein    351480
Name: COMPUTE_EXTRA_BOOKING_FEE, dtype: int64 

DATE_BOOKING = DATE_FROM 
 True    351480
dtype: int64

The above shows that the columns CATEGORY_HAL_ID, TRAVERSE_USE, DISTANCE, CITY_RENTAL_ZONE, RENTAL_ZONE_HAL_SRC, COMPUTE_EXTRA_BOOKING_FEE and DATE_BOOKING do not include any information as they contain the same for any observation. These columns can therefore be eliminated.

The column TECHNICAL_INCOME_CHANNEL does contain different values for observations. These values however cannot be interpreted as there is no explanation provided by DB. Therfore this column can also be dropped.

Column Decision Reason
BOOKIN_HAL_ID KEPT will be used to calculate the booking counts as it represents a unique booking
VEHICLE_HAL_ID KEPT will be used to calculate the number of vehicles that are used to satisfy the demand
CUSTOMER_HAL_ID KEPT kept for clustering purposes unique identifier of the customer
DATE_FROM KEPT equal to date_booking but since date booking is used as index this column is needed later
DATE_UNTIL KEPT kept for duration calculation and possible graphics
START_RENTAL_ZONE KEPT kept for clustering
START_RENTAL_ZONE_HAL_ID KEPT kept for later merge
END_RENTAL_ZONE KEPT kept for clustering
END_RENTAL_ZONE_HAL_ID KEPT kept for later merge
RENTAL_ZONE_HAL_SRC REMOVED always standort so no information
TECHNICAL_INCOME_CHANNEL REMOVED no explanation in the documentation and therefore not interpretable
COMPUTE_EXTRA_BOOKING_FEE REMOVED always no so dropped
DATE_BOOKING REMOVED duplicate of DateFrom
In [5]:
# Dropping unnecessary columns
data.drop(columns={'CATEGORY_HAL_ID', 'TRAVERSE_USE', 'DISTANCE', 'CITY_RENTAL_ZONE', 'DATE_BOOKING',
                   'TECHNICAL_INCOME_CHANNEL', 'RENTAL_ZONE_HAL_SRC', 'COMPUTE_EXTRA_BOOKING_FEE'}, 
          axis=1,
          inplace=True)
In [6]:
# Printing a summary of data
print(data.info())

# Calculating the loss of eliminating observations with missing data
percentage_of_missing_values = round(((len(data) - len(data.dropna(axis=0))) / len(data)) * 100, 4)
print('\nLoss associated eliminating observations with missing values: {}%'.format(percentage_of_missing_values))
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 351480 entries, 2015-01-01 01:29:04 to 2016-12-31 17:39:14
Data columns (total 9 columns):
BOOKING_HAL_ID              351480 non-null int64
VEHICLE_HAL_ID              351480 non-null int64
CUSTOMER_HAL_ID             351480 non-null object
DATE_FROM                   351480 non-null object
DATE_UNTIL                  351480 non-null object
START_RENTAL_ZONE           351465 non-null object
START_RENTAL_ZONE_HAL_ID    351480 non-null float64
END_RENTAL_ZONE             351479 non-null object
END_RENTAL_ZONE_HAL_ID      351479 non-null float64
dtypes: float64(2), int64(2), object(5)
memory usage: 26.8+ MB
None

Loss associated eliminating observations with missing values: 0.0046%

The columns START_RENTAL_ZONE, END_RENTAL_ZONE and END_RENTAL_ZONE_HAL_ID contain missing Values. Dropping the corresponding observations will only lead to a loss of 0.0046% which is acceptable.

In [7]:
# Dropping rows containing any missing value
data.dropna(axis=0, inplace=True)

Data Preparation for Clustering

In [8]:
# Importing data containing the geographical information
zone_data = pd.read_csv('datafiles/raw/OPENDATA_RENTAL_ZONE_CALL_A_BIKE.csv', sep=';')

# Filtering the dataset for relevant city
zone_data = zone_data[zone_data['CITY'] == 'Kassel']

# Dropping unnecessary columns
zone_data = zone_data[['RENTAL_ZONE_HAL_ID', 'LATITUDE', 'LONGITUDE']]

zone_data.head()
Out[8]:
RENTAL_ZONE_HAL_ID LATITUDE LONGITUDE
472 204778 9,446111000000000 51,280278000000000
473 204779 9,472777777777777 51,289722222222220
474 204780 9,468852000000000 51,284697000000000
475 204781 9,481111111111112 51,298611111111110
476 204782 9,499444444444444 51,300277777777780

The columns LATITUDE and LONGITUDE are in the wrong formatting in order to use them for Distance calculation the ',' have to be replaced with '.'. Also both columns are interchanged, which has been tested with Google maps.

In [9]:
# Swapping LATITUDE and LONGTIUDE and replacing ',' with '.' 
zone_data['LATITUDE'] = zone_data['LATITUDE'].apply(lambda x: float(str(x).replace(',', '.')))
zone_data['LONGITUDE'] = zone_data['LONGITUDE'].apply(lambda x: float(str(x).replace(',', '.')))
zone_data.rename(columns={'LONGITUDE': 'LATITUDE', 'LATITUDE': 'LONGITUDE'}, inplace=True)
In [10]:
zone_data.head()
Out[10]:
RENTAL_ZONE_HAL_ID LONGITUDE LATITUDE
472 204778 9.446111 51.280278
473 204779 9.472778 51.289722
474 204780 9.468852 51.284697
475 204781 9.481111 51.298611
476 204782 9.499444 51.300278

Creating Start Longitude and Latitude

In [11]:
# Mergin LATITUDE and LONGTIDUE for clustering analysis
data = data.merge(zone_data, how='left', left_on='START_RENTAL_ZONE_HAL_ID', right_on='RENTAL_ZONE_HAL_ID')

# Dropping RENTAL_ZONE_HAL_ID resulting from merging rental_zone since not needed
data.drop(columns=['RENTAL_ZONE_HAL_ID'], axis=1, inplace=True)

# Renaming columns LATITUDE and LONGITUDE to START_LATITUDE and START_LONGITUDE for cluster analysis
data.rename(columns={'LATITUDE': 'START_LATITUDE', 'LONGITUDE': 'START_LONGITUDE'}, inplace=True)

Creating End Longitude and Latitude

In [12]:
# Merging LATITUDE and LONGTIDUE for cluster analysis
data = data.merge(zone_data, how='left', left_on='END_RENTAL_ZONE_HAL_ID', right_on='RENTAL_ZONE_HAL_ID')

# Dropping RENTAL_ZONE_HAL_ID resulting from merging rental_zone since not needed
data.drop(columns=['RENTAL_ZONE_HAL_ID'], axis=1, inplace=True)

# Renaming columns LATITUDE and LONGITUDE to END_LATITUDE and END_LONGITUDE for cluster analysis
data.rename(columns={'LATITUDE': 'END_LATITUDE', 'LONGITUDE': 'END_LONGITUDE'}, inplace=True)

Calculating the distance

In [13]:
data['START_RENTAL_ZONE_HAL_ID'] = data['START_RENTAL_ZONE_HAL_ID'].apply(lambda x: int(x))
data['END_RENTAL_ZONE_HAL_ID'] = data['END_RENTAL_ZONE_HAL_ID'].apply(lambda x: int(x))
In [14]:
# Extracting the individual RENZAL_ZONE_IDs
zone_ids = zone_data['RENTAL_ZONE_HAL_ID']

# Creating a list containing every possible START/END zone combination
combinations = list(product(zone_ids, zone_ids))

# Creating a DataFrame for distance calculations having individual combinations as index
comb_frame = pd.DataFrame(index=combinations)

# For loop calculating the distance of each combination. If coordinates are not presen Distance is set to -1
for i in combinations:
    start_latitude = zone_data[zone_data['RENTAL_ZONE_HAL_ID'] == i[0]].LATITUDE.values
    start_longitude = zone_data[zone_data['RENTAL_ZONE_HAL_ID'] == i[0]].LONGITUDE.values
    end_latitude = zone_data[zone_data['RENTAL_ZONE_HAL_ID'] == i[1]].LATITUDE.values
    end_logitude = zone_data[zone_data['RENTAL_ZONE_HAL_ID'] == i[1]].LONGITUDE.values
    try:
        comb_frame.loc[i, 'DISTANCE'] = geopy.distance.distance((start_latitude, start_longitude),
                                                                (end_latitude, end_logitude)).km
    except ValueError:
        comb_frame.loc[i, 'DISTANCE'] = -1

# Resetting the index for the merge
comb_frame.reset_index(inplace=True)

# Converting the index to string for the merge
comb_frame['index'] = comb_frame['index'].apply(lambda x: str(x))

# Converting the ID columns to int
data['START_RENTAL_ZONE_HAL_ID'] = data['START_RENTAL_ZONE_HAL_ID'].apply(lambda x: int(x))
data['END_RENTAL_ZONE_HAL_ID'] = data['END_RENTAL_ZONE_HAL_ID'].apply(lambda x: int(x))

# Create column that can be matched with the index column of comb_frame
data['combination'] = '(' + data['START_RENTAL_ZONE_HAL_ID'].astype(str) + ', ' + data['END_RENTAL_ZONE_HAL_ID'].astype(
    str) + ')'

# Merging data and comb_frame
data = data.merge(comb_frame, how='left', left_on='combination', right_on='index')

# Dropping columns that were necessary for merging
data.drop(columns=['index', 'combination'], axis=1, inplace=True)

# Deleting data that is not used anymore to free RAM
del [comb_frame, zone_data, combinations, zone_ids]

Calculation of BOOKING_DURATION

In [15]:
# Calculating booking duration by end - start
data['BOOKING_DURATION'] = pd.to_datetime(data.DATE_UNTIL) - pd.to_datetime(data.DATE_FROM)

# Transforming booking duration to int
data['BOOKING_DURATION'] = data.BOOKING_DURATION.apply(lambda x: x.total_seconds() / 60)
In [16]:
# Creating final cluster data variable and remove unrealistic bookings
cluster_data = data.copy()
cluster_data = cluster_data.loc[cluster_data['BOOKING_DURATION'] < 10000, :]

Data Preparation for Predictive Analysis

For prediction, the only useful feature of the original dataset is the number of bikes rented in a specific time frame. Even though prediction could be done on the own trajectory only, adding additional information from other sources can enhance predictive accuracy. To be able to merge hourly climate data and to ensure evenly spaced time steps the data will be re-sampled to a one hour resolution, which is achieved by counting the number of unique rentals in a specific hour. After this the climate data and information on holidays will be merged. Since the dataset then contains all available information the dataset will then be re-sampled to a two, six and twenty four hour resolution choosing appropriate aggregation for each feature.

Resampling to One Hour Resolution

In [17]:
# Setting a datetimeindex for resampling
data.set_index(pd.to_datetime(data['DATE_FROM']), inplace=True)

# Dropping columns that are not needed during predictive analysis
data.drop(columns=['END_RENTAL_ZONE', 'END_RENTAL_ZONE_HAL_ID', 'START_RENTAL_ZONE', 'START_RENTAL_ZONE_HAL_ID',
                   'CUSTOMER_HAL_ID', 'DATE_UNTIL', 'START_LATITUDE', 'START_LONGITUDE', 'END_LATITUDE', 'DATE_FROM',
                   'END_LONGITUDE','VEHICLE_HAL_ID','BOOKING_DURATION','DISTANCE'], axis=1, inplace=True)

# Resampling to one hour resolution
one_h_data = data.resample('H').agg({'BOOKING_HAL_ID': np.count_nonzero})

Climate Data

Since the meteorological station of Kassel has moved in 2013 (for reference consult Link), the climate data used in this analysis is taken from the station in Schauenburg Elgershausen which is only eleven kilometers away from Kassel. The data was obtained from the climate data center of the DWD (Link) using the associated station number is 15207.

The following climate features have been used:

  • hourly sunshine duration
  • air temperature and humidity at two meters above ground
  • hourly precipitation height
  • binary feature indicating whether rain has fallen in the hour
  • mean wind speed
In [18]:
# Importing and preparing data containig information of hourly sunshine duration
sd_path = 'datafiles/weather/stundenwerte_SunshineDuration_15207_20131101_20181231_hist/produkt_sd_stunde_20131101_20181231_15207.txt'
sd_data = pd.read_csv(sd_path, sep=';')
sd_data['MESS_DATUM'] = pd.to_datetime(sd_data.MESS_DATUM, format='%Y%m%d%H')
sd_data.set_index('MESS_DATUM', inplace=True, drop=True)
sd_data.rename(columns={'SD_SO': 'SUNSHINE_DURATION'}, inplace=True)
sd_data.drop(['QN_7', 'eor', 'STATIONS_ID'], axis=1, inplace=True)

# Importing and preparing data containig information of hourly temparature and humidity
temp_path = 'datafiles/weather/stundenwerte_TemperatureHumidity_15207_20131101_20181231_hist/produkt_tu_stunde_20131101_20181231_15207.txt'
temp_data = pd.read_csv(temp_path, sep=';')
temp_data['MESS_DATUM'] = pd.to_datetime(temp_data.MESS_DATUM, format='%Y%m%d%H')
temp_data.set_index('MESS_DATUM', inplace=True, drop=True)
temp_data.rename(columns={'TT_TU': 'TWO_METER_AIR_TEMP', 'RF_TU': 'TWO_METER_HUMIDITY'}, inplace=True)
temp_data.drop(['QN_9', 'eor', 'STATIONS_ID'], axis=1, inplace=True)

# Importing and preparing data containig information of hourly precipitation and if rain has fallen
rain_path = 'datafiles/weather/stundenwerte_Precipitation_15207_20131101_20181231_hist/produkt_rr_stunde_20131101_20181231_15207.txt'
rain_data = pd.read_csv(rain_path, sep=';')
rain_data['MESS_DATUM'] = pd.to_datetime(rain_data.MESS_DATUM, format='%Y%m%d%H')
rain_data.set_index('MESS_DATUM', inplace=True, drop=True)
rain_data.rename(columns={'  R1': 'HOURLY_PRECIPITATION_HEIGHT', 'RS_IND': 'RAIN_FALLEN'}, inplace=True)
rain_data.drop(['QN_8', 'eor', 'STATIONS_ID', 'WRTR'], axis=1, inplace=True)

# Importing and preparing data containig information of hourly mean wind speed
wind_path = 'datafiles/weather/stundenwerte_Wind_15207_20131101_20181231_hist/produkt_ff_stunde_20131101_20181231_15207.txt'
wind_data = pd.read_csv(wind_path, sep=';')
wind_data['MESS_DATUM'] = pd.to_datetime(wind_data.MESS_DATUM, format='%Y%m%d%H')
wind_data.set_index('MESS_DATUM', inplace=True, drop=True)
wind_data.rename(columns={'   F': 'MEAN_WIND_SPEED', '   D': 'MEAN_WIND_DIRECTION'}, inplace=True)
wind_data.drop(['QN_3', 'eor', 'STATIONS_ID', 'MEAN_WIND_DIRECTION'], axis=1, inplace=True)

# Merging climate data to main dataset
one_h_data = one_h_data.merge(sd_data, how='left', left_index=True, right_index=True)
one_h_data = one_h_data.merge(temp_data, how='left', left_index=True, right_index=True)
one_h_data = one_h_data.merge(rain_data, how='left', left_index=True, right_index=True)
one_h_data = one_h_data.merge(wind_data, how='left', left_index=True, right_index=True)

# Deleting data that is not used anymore to free RAM
del [wind_data, wind_path, rain_data, rain_path, temp_data, temp_path, sd_data, sd_path]

Imputation of missing climate data

Now that the data is in a reasonable format and contains additional climate features, the completeness of the set will be examined. After that missing values will be imputed using appropriate methods

In [19]:
one_h_data.info()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 17544 entries, 2015-01-01 00:00:00 to 2016-12-31 23:00:00
Freq: H
Data columns (total 7 columns):
BOOKING_HAL_ID                 17544 non-null int64
SUNSHINE_DURATION              13133 non-null float64
TWO_METER_AIR_TEMP             17538 non-null float64
TWO_METER_HUMIDITY             17538 non-null float64
HOURLY_PRECIPITATION_HEIGHT    17539 non-null float64
RAIN_FALLEN                    17539 non-null float64
MEAN_WIND_SPEED                17443 non-null float64
dtypes: float64(6), int64(1)
memory usage: 1.7 MB

From the above it can be seen the climate data contains missing values. Especially the feature SUNSHINE_DURATION is not complete. One reason for this is that at night, where sunshine duration is not measurable, the DWD does record missing values. These missing values can therefore be logically filled with 0's since sun is not shining at night times.

The rest of the features contain an equal amount of missing values all at in the same time frame. Consequently these missing values could not be estimated using methods like regression. Therefore the build-in method time is chosen which linearly interpolates from the last non-missing observation to the next available non-missing observation

In [20]:
# Replacing the code for missing values (-999) of DWD to nan
for column in one_h_data.columns[-6:]:
    one_h_data[column].replace(-999.0, np.nan, inplace=True)

# Extracting the hour of the date and casting it to type int    
one_h_data['HOUR'] = pd.to_datetime(one_h_data.index).strftime('%-H')
one_h_data['HOUR'] = one_h_data['HOUR'].apply(lambda x: int(x))

# Imputing missing values of SUNSHINE_DURATION at night with zero and drop column HOUR
one_h_data.loc[
    one_h_data.SUNSHINE_DURATION.isna() & ((one_h_data.HOUR <= 3) | (one_h_data.HOUR >= 23)), ['SUNSHINE_DURATION']] = 0
del one_h_data['HOUR']

# Impute the rest of missing values using method time
for column in one_h_data.columns[-6:]:
    one_h_data[column].interpolate(method='time', inplace=True)

Holidays

Since rental demand can vary by working and non-working day in the following school as well as bank holiday information is added to the data as a binary feature for each. Information about bank holidays is obtained using the package holidays which allows to draw the dates which are bank holidays of a specific country and region. School holidays are set manually specifying the start and end date of a specific school-free time.

Bank Holidays

In [21]:
# Importing german holidays of province hessen
ger_holidays = holidays.CountryHoliday('DE', prov='HE')

# Creating individual lists for year, month and day of month on the basis of the data
year_int = one_h_data.index.strftime('%Y').astype(int)
month_int = one_h_data.index.strftime('%m').astype(int)
day_int = one_h_data.index.strftime('%d').astype(int)


def get_holiday(dyear, dmonth, dday):
    """
    Function to check wether a certain year, month, day of month combination is a bank holiday 
    
    :param dyear: year as an integer
    :param dmonth: month as an integer
    :param dday: day of the month as an integer
    :return: 1 if combination is on a bank holiday, 0 otherwise
    """
    is_holiday = holidays.date(dyear, dmonth, dday) in ger_holidays
    return is_holiday

# Application of get_holiday to the previously specified combinations
holidayList = list(map(get_holiday, year_int, month_int, day_int))

# Assingning the list of holidays to individual column
one_h_data['BANK_HOLIDAY'] = holidayList

# Converting bool to int
one_h_data['BANK_HOLIDAY'] = one_h_data['BANK_HOLIDAY'].apply(lambda x: int(x))

School Holidays

In [22]:
# School holidays 2015
one_h_data.loc[
    (one_h_data.index >= '2015-01-01') & (one_h_data.index <= '2015-01-11 23:59:59'), 'SCHOOL_HOLIDAY'] = 1
one_h_data.loc[
    (one_h_data.index >= '2015-03-30') & (one_h_data.index <= '2015-04-12 23:59:59'), 'SCHOOL_HOLIDAY'] = 1
one_h_data.loc[
    (one_h_data.index >= '2015-07-27') & (one_h_data.index <= '2015-09-06 23:59:59'), 'SCHOOL_HOLIDAY'] = 1
one_h_data.loc[
    (one_h_data.index >= '2015-10-19') & (one_h_data.index <= '2015-10-31 23:59:59'), 'SCHOOL_HOLIDAY'] = 1
one_h_data.loc[
    (one_h_data.index >= '2015-12-23') & (one_h_data.index <= '2015-12-31 23:59:59'), 'SCHOOL_HOLIDAY'] = 1

# School holidays 2016
one_h_data.loc[
    (one_h_data.index >= '2016-01-01') & (one_h_data.index <= '2016-01-10 23:59:59'), 'SCHOOL_HOLIDAY'] = 1
one_h_data.loc[
    (one_h_data.index >= '2016-03-29') & (one_h_data.index <= '2016-04-10 23:59:59'), 'SCHOOL_HOLIDAY'] = 1
one_h_data.loc[
    (one_h_data.index >= '2016-07-18') & (one_h_data.index <= '2016-08-28 23:59:59'), 'SCHOOL_HOLIDAY'] = 1
one_h_data.loc[
    (one_h_data.index >= '2016-10-17') & (one_h_data.index <= '2016-10-30 23:59:59'), 'SCHOOL_HOLIDAY'] = 1
one_h_data.loc[
    (one_h_data.index >= '2016-12-22') & (one_h_data.index <= '2016-12-31 23:59:59'), 'SCHOOL_HOLIDAY'] = 1

# Filling days which are not in school holidays with 0
one_h_data.loc[one_h_data.SCHOOL_HOLIDAY.isnull(), 'SCHOOL_HOLIDAY'] = 0

# Renaming columns for later analysis
one_h_data.rename(columns={'BOOKING_HAL_ID': 'COUNT', 'TWO_METER_AIR_TEMP': 'AIR_TEMP', 
                           'TWO_METER_HUMIDITY': 'HUMIDITY','HOURLY_PRECIPITATION_HEIGHT': 'PRECIPITATION_HEIGHT'},
                  inplace=True)

Resampling to 2H, 6H, 24H Resolutions

Since now all additional features are contained in the data, the data will be re-sampled to two, six and twenty four hour resolutions. The features will be aggregated in regard to their type of recording i.e. means will be aggregated using their means, counts or normal measurements will be summed and binary features will stay binary encoded.

In [23]:
# Resampling to two hour resolution
two_h_data = one_h_data.resample('2H').agg({'COUNT': np.sum,
                                            'SUNSHINE_DURATION': np.sum,
                                            'AIR_TEMP': np.mean,
                                            'HUMIDITY': np.mean,
                                            'PRECIPITATION_HEIGHT': np.sum,
                                            'RAIN_FALLEN': lambda x: x.nunique() - 1,
                                            'MEAN_WIND_SPEED': np.mean,
                                            'BANK_HOLIDAY': np.mean,
                                            'SCHOOL_HOLIDAY': np.mean})

# Resampling to six hour resolution
six_h_data = one_h_data.resample('6H').agg({'COUNT': np.sum,
                                            'SUNSHINE_DURATION': np.sum,
                                            'AIR_TEMP': np.mean,
                                            'HUMIDITY': np.mean,
                                            'PRECIPITATION_HEIGHT': np.sum,
                                            'RAIN_FALLEN': lambda x: x.nunique() - 1,
                                            'MEAN_WIND_SPEED': np.mean,
                                            'BANK_HOLIDAY': np.mean,
                                            'SCHOOL_HOLIDAY': np.mean})
# Resampling to 24 hour resolution
twentyfour_h_data = one_h_data.resample('1D').agg({'COUNT': np.sum,
                                                   'SUNSHINE_DURATION': np.sum,
                                                   'AIR_TEMP': np.mean,
                                                   'HUMIDITY': np.mean,
                                                   'PRECIPITATION_HEIGHT': np.sum,
                                                   'RAIN_FALLEN': lambda x: x.nunique() - 1,
                                                   'MEAN_WIND_SPEED': np.mean,
                                                   'BANK_HOLIDAY': np.mean,
                                                   'SCHOOL_HOLIDAY': np.mean})

Prediction Data Description

Column dtype Description
COUNT int64 The number of bikes rented in the next period.
SUNSHINE_DURATION float64 The amount of sun shining in the next period in minutes.
AIR_TEMP float64 The mean temperature at two meters above ground in the next period in degress.
HUMIDITY float64 The mean humidity at two meters above ground in the next period in percent.
PRECIPITATION_HEIGHT float64 The amount precipitation in the next period in mm.
RAIN_FALLEN float64 Binary variable indicating whether rain has fallen in the next period.
MEAN_WIND_SPEED float64 The mean wind speed in the next period in kmph.
BANK_HOLIDAY int64 Binary variable indicating whether there is a bank holiday in the next period.
SCHOOL_HOLIDAY float64 Binary variable indicating whether there is a school holiday in the next period.

Descriptive and Explorative Analysis

Inspecting the Stations

delete the next two cells if there is no important information maybe write about missing values in some columns otherwise drop

In [24]:
cluster_data.head()
Out[24]:
BOOKING_HAL_ID VEHICLE_HAL_ID CUSTOMER_HAL_ID DATE_FROM DATE_UNTIL START_RENTAL_ZONE START_RENTAL_ZONE_HAL_ID END_RENTAL_ZONE END_RENTAL_ZONE_HAL_ID START_LONGITUDE START_LATITUDE END_LONGITUDE END_LATITUDE DISTANCE BOOKING_DURATION
0 27689266 117157 850FFC361BA66AC4918B50DAD1CE90893ACDBC83 2015-01-01 01:29:04 2015-01-01 01:29:39 Klinikum Kassel / Mönchebergstraße 204820 Klinikum Kassel / Mönchebergstraße 204820 9.509167 51.327222 9.509167 51.327222 0.000000 0.583333
1 27689762 116780 C793F8DF508C323CC19FC3DA031B12FFE26861E9 2015-01-01 03:17:55 2015-01-01 03:26:56 Schwimmbad Auedamm 204782 Uni-Kassel / Menzelstr. 204785 9.499444 51.300278 9.486944 51.303611 0.947364 9.016667
2 27689082 117149 36AB6ABBC0733C2B96BBC604FD027EEF11E9A692 2015-01-01 00:43:22 2015-01-01 02:26:09 Karthäuser Str. / FES 204803 Karthäuser Str. / FES 204803 9.485278 51.314444 9.485278 51.314444 0.000000 102.783333
3 27689296 117325 FC2E123D74CC42F02CA74668EA441CEA4BA73708 2015-01-01 01:38:58 2015-01-01 01:49:40 Haltestelle Kirchweg/Wilhelmshöher Allee 204792 Auestadion 204781 9.465000 51.311944 9.481111 51.298611 1.860846 10.700000
4 27689735 108208 7CBCDCAAEFF82B086D96F1497F059EB540ED3CCE 2015-01-01 03:11:55 2015-01-01 03:14:17 Georg-Stock-Platz / Kohlenstraße 204788 Georg-Stock-Platz / Kohlenstraße 204788 9.468056 51.309167 9.468056 51.309167 0.000000 2.366667
In [25]:
cluster_data.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 351452 entries, 0 to 351463
Data columns (total 15 columns):
BOOKING_HAL_ID              351452 non-null int64
VEHICLE_HAL_ID              351452 non-null int64
CUSTOMER_HAL_ID             351452 non-null object
DATE_FROM                   351452 non-null object
DATE_UNTIL                  351452 non-null object
START_RENTAL_ZONE           351452 non-null object
START_RENTAL_ZONE_HAL_ID    351452 non-null int64
END_RENTAL_ZONE             351452 non-null object
END_RENTAL_ZONE_HAL_ID      351452 non-null int64
START_LONGITUDE             349102 non-null float64
START_LATITUDE              349102 non-null float64
END_LONGITUDE               349423 non-null float64
END_LATITUDE                349423 non-null float64
DISTANCE                    347261 non-null float64
BOOKING_DURATION            351452 non-null float64
dtypes: float64(6), int64(4), object(5)
memory usage: 42.9+ MB

For an overview of the number of started bookings for all rental zones, the dataframe start_stations is created. It groups the rental zones by aggregating the sum of total departure bookings for a period of two years.

In [26]:
# Creating a dataframe, aggregating the sum of total departures by grouping the rental zones.
start_stations = cluster_data[['START_RENTAL_ZONE', 'START_LATITUDE', 'START_LONGITUDE', 'BOOKING_HAL_ID']].groupby(['START_RENTAL_ZONE', 'START_LATITUDE', 'START_LONGITUDE'])["BOOKING_HAL_ID"].count().reset_index(name="TOTAL_DEPARTURES")
start_stations.dropna(axis=0, inplace=True)    
start_stations.head()
Out[26]:
START_RENTAL_ZONE START_LATITUDE START_LONGITUDE TOTAL_DEPARTURES
0 Alt Waldau / Nürnberger Straße 51.289950 9.515127 1289
1 Am Heimbach / Kohlenstraße 51.308043 9.453924 4185
2 Annastraße / Friedrich-Ebert-Str. 51.315278 9.477500 2686
3 Auestadion 51.298611 9.481111 13184
4 Brandgasse / Altenbaunaer Straße 51.274521 9.448309 745

Following, a map of start rental zones in Kassel. It is sufficient to map only the starting rental zones, because any rental zone where no booking has been made within the two-year period is unlikely to be of high relevance.

In the first iteration of the data visualization, the rental zones were mapped in the Indian Ocean. It was therefore clear that it is necessary to switch the column names of LONGITUDE and LATITUDE.

In [27]:
# Building map of total departures for rental zones
m = folium.Map(
    location=[51.312013, 9.481551],
    tiles='OpenStreetMap',
    zoom_start=12
)
# Add markers for each rental zones
tooltip = 'Click me!'
for index, row in start_stations.iterrows():
    folium.Marker([row['START_LATITUDE'], row['START_LONGITUDE']], radius=10,popup=row['START_RENTAL_ZONE'], tooltip='Click to see the station name').add_to(m)
Out[27]:
<folium.map.Marker at 0x1a27671d0>
Out[27]:
<folium.map.Marker at 0x1994ed9e8>
Out[27]:
<folium.map.Marker at 0x12b0b6da0>
Out[27]:
<folium.map.Marker at 0x12b0b6e48>
Out[27]:
<folium.map.Marker at 0x12b0b6cf8>
Out[27]:
<folium.map.Marker at 0x1a2767278>
Out[27]:
<folium.map.Marker at 0x18a205e48>
Out[27]:
<folium.map.Marker at 0x18a205d68>
Out[27]:
<folium.map.Marker at 0x18a205b70>
Out[27]:
<folium.map.Marker at 0x18a2059e8>
Out[27]:
<folium.map.Marker at 0x18a205d30>
Out[27]:
<folium.map.Marker at 0x18a2056d8>
Out[27]:
<folium.map.Marker at 0x18a205550>
Out[27]:
<folium.map.Marker at 0x18a2053c8>
Out[27]:
<folium.map.Marker at 0x18a205240>
Out[27]:
<folium.map.Marker at 0x18a205400>
Out[27]:
<folium.map.Marker at 0x185c2c048>
Out[27]:
<folium.map.Marker at 0x185c2c2b0>
Out[27]:
<folium.map.Marker at 0x185c2c438>
Out[27]:
<folium.map.Marker at 0x185c2c5c0>
Out[27]:
<folium.map.Marker at 0x185c2c748>
Out[27]:
<folium.map.Marker at 0x185c2c908>
Out[27]:
<folium.map.Marker at 0x185c2ca58>
Out[27]:
<folium.map.Marker at 0x185c2cbe0>
Out[27]:
<folium.map.Marker at 0x185c2cd68>
Out[27]:
<folium.map.Marker at 0x185c2cf28>
Out[27]:
<folium.map.Marker at 0x185c2cfd0>
Out[27]:
<folium.map.Marker at 0x185c2cef0>
Out[27]:
<folium.map.Marker at 0x177e03080>
Out[27]:
<folium.map.Marker at 0x177e03550>
Out[27]:
<folium.map.Marker at 0x177e03668>
Out[27]:
<folium.map.Marker at 0x177e03860>
Out[27]:
<folium.map.Marker at 0x177e03a20>
Out[27]:
<folium.map.Marker at 0x177e03b70>
Out[27]:
<folium.map.Marker at 0x177e03c88>
Out[27]:
<folium.map.Marker at 0x177e03e80>
Out[27]:
<folium.map.Marker at 0x177e03f60>
Out[27]:
<folium.map.Marker at 0x18f9301d0>
Out[27]:
<folium.map.Marker at 0x18f930208>
Out[27]:
<folium.map.Marker at 0x18f930358>
Out[27]:
<folium.map.Marker at 0x18f930390>
Out[27]:
<folium.map.Marker at 0x18f9307f0>
Out[27]:
<folium.map.Marker at 0x18f930908>
Out[27]:
<folium.map.Marker at 0x18f930b00>
Out[27]:
<folium.map.Marker at 0x18f930cc0>
Out[27]:
<folium.map.Marker at 0x18f9304e0>
Out[27]:
<folium.map.Marker at 0x18f930f98>
Out[27]:
<folium.map.Marker at 0x18f930da0>
Out[27]:
<folium.map.Marker at 0x192eab2e8>
Out[27]:
<folium.map.Marker at 0x192eab470>
Out[27]:
<folium.map.Marker at 0x192eab518>
Out[27]:
<folium.map.Marker at 0x192eab780>
Out[27]:
<folium.map.Marker at 0x192eab940>
Out[27]:
<folium.map.Marker at 0x192eaba90>
Out[27]:
<folium.map.Marker at 0x192eabc50>
Out[27]:
<folium.map.Marker at 0x192eabda0>
Out[27]:
<folium.map.Marker at 0x192eabf28>
Out[27]:
<folium.map.Marker at 0x192eabeb8>
In [28]:
m
Out[28]:

Analogue to the start_stations there is an end_stations dataframe created for an overview of the number of ended bookings for all rental zones. It groups the rental zones by aggregating the sum of total arrival bookings for a period of two years.

In [29]:
# Creating a dataframe, aggregating the sum of total arrivals by grouping the rental zones.
end_stations = cluster_data[['END_RENTAL_ZONE', 'END_LATITUDE', 'END_LONGITUDE', 'BOOKING_HAL_ID']].groupby(['END_RENTAL_ZONE', 'END_LATITUDE', 'END_LONGITUDE'])['BOOKING_HAL_ID'].count().reset_index(name='TOTAL_ARRIVALS')
end_stations.dropna(axis=0, inplace=True)
end_stations.head()
Out[29]:
END_RENTAL_ZONE END_LATITUDE END_LONGITUDE TOTAL_ARRIVALS
0 Alt Waldau / Nürnberger Straße 51.289950 9.515127 1832
1 Am Heimbach / Kohlenstraße 51.308043 9.453924 3392
2 Annastraße / Friedrich-Ebert-Str. 51.315278 9.477500 1641
3 Auestadion 51.298611 9.481111 19442
4 Brandgasse / Altenbaunaer Straße 51.274521 9.448309 853

Rental Zone Patterns over Time

Arrivals

The following heatmap shows the previously aggregated values of total arrivals over the two-year period per rental station from the dataframe end_stations.

In [30]:
# Creating heatmap of arrivals per rental zone
fig = px.scatter_mapbox(end_stations, lat="END_LATITUDE", lon="END_LONGITUDE", hover_name="END_RENTAL_ZONE", hover_data=["TOTAL_ARRIVALS"],
                        color="TOTAL_ARRIVALS", size="TOTAL_ARRIVALS", color_continuous_scale=[[0, 'rgba(150, 150, 255, 0.85)'], 
               [1, 'rgba(0,0,255, 0.85)']], zoom=11.5, height=400)
fig = fig.update_layout(mapbox_style="open-street-map")
fig = fig.update_layout(margin={"r":0,"t":0,"l":0,"b":0})
fig.show()

After an analysis of the map, it is apperent that the rental zones with the most arrivals (where the most bookings have ended) are close to or are at university locations.

Departures

The following heatmap shows the previously aggregated values of total departures over the two-year period per rental station from the dataframe start_stations.

In [32]:
# Creating heatmap of departures per rental zone
fig = px.scatter_mapbox(start_stations, lat="START_LATITUDE", lon="START_LONGITUDE", hover_name="START_RENTAL_ZONE", hover_data=["TOTAL_DEPARTURES"],
                        color="TOTAL_DEPARTURES", size="TOTAL_DEPARTURES", color_continuous_scale=[[0, 'rgba(255, 180, 180, 0.85)'], 
               [1, 'rgba(255,0,0, 0.85)']], zoom=11.5, height=400)
fig = fig.update_layout(mapbox_style="open-street-map")
fig = fig.update_layout(margin={"r":0,"t":0,"l":0,"b":0})
fig.show()

After an analysis of the map, it is apperent that the most bikes were rented at the rental zone "Uni-Kassel / Manzelstr.". Also, what is noticebale, is that many bikes are rented in the proximity of train stations or university locations.

To show a surplus or deficiency of available bikes at the rental zones, the net depature of each rental zone is calculated by subtracting total depatures from total arrivals.

In [33]:
# Calcuation of net depature (= arrivals - depatures)
net_departures = pd.merge(start_stations[['START_RENTAL_ZONE', 'TOTAL_DEPARTURES']], end_stations, left_on="START_RENTAL_ZONE", right_on="END_RENTAL_ZONE", how='left')
net_departures['NET_DEPARTURE'] = net_departures['TOTAL_ARRIVALS'] - net_departures['TOTAL_DEPARTURES']
net_departures.rename(columns={'START_RENTAL_ZONE':'RENTAL_ZONE', 'END_LONGITUDE':'LONGITUDE', 'END_LATITUDE':'LATITUDE'}, inplace=True)
net_departures = net_departures[['RENTAL_ZONE', 'LATITUDE', 'LONGITUDE', 'TOTAL_DEPARTURES', 'TOTAL_ARRIVALS', 'NET_DEPARTURE']]
net_departures.head()
Out[33]:
RENTAL_ZONE LATITUDE LONGITUDE TOTAL_DEPARTURES TOTAL_ARRIVALS NET_DEPARTURE
0 Alt Waldau / Nürnberger Straße 51.289950 9.515127 1289 1832 543
1 Am Heimbach / Kohlenstraße 51.308043 9.453924 4185 3392 -793
2 Annastraße / Friedrich-Ebert-Str. 51.315278 9.477500 2686 1641 -1045
3 Auestadion 51.298611 9.481111 13184 19442 6258
4 Brandgasse / Altenbaunaer Straße 51.274521 9.448309 745 853 108

The heatmap plotted below shows the calculated net depatures for each rental zone. If there is a deficiency in available bikes at a given rental zone (negative net depature), the color of the marker is red. Contrarily, if there is a surplus in avaliable bikes at a given rental zone (positive net depature), the color of the marker is blue.

In [34]:
# Creapting heatmap of net departures per rental zone
fig = px.scatter_mapbox(net_departures, lat="LATITUDE", lon="LONGITUDE", hover_name="RENTAL_ZONE", hover_data=["NET_DEPARTURE"],
                        color="NET_DEPARTURE", size=net_departures["NET_DEPARTURE"].abs(), color_continuous_scale=[[0, 'rgba(255, 0, 0, 0.85)'], [abs(min(net_departures['NET_DEPARTURE'])/(abs(min(net_departures['NET_DEPARTURE']))+max(net_departures['NET_DEPARTURE']))), 'rgba(255, 255, 255, 0.85)'], [1, 'rgba(0,0,255, 0.85)']], zoom=11.5, height=400)
fig = fig.update_layout(mapbox_style="open-street-map")
fig = fig.update_layout(margin={"r":0,"t":0,"l":0,"b":0})
fig.show()

The map reveals that there is a net shift of bikes from train stations in the western part of the city to university locations in the eastern part of the city. It is also salient that the two rental zones "ICE Bahnhof / Willy-Brandt-Platz" and the "Hauptbahnhof / Rainer-Dierichs-Platz" have particularly high negative net values. In contrast the rental zones "Uni-Kassel / Menzelstr." and "Weserspitze / Weserstraße" have particularly high positive net values.

Weekly Demand Patterns

To examine if there is a shift from a certain group of rental zones to another group of rental zones in the course of a week (e.g. higher demands in the outscirts on the weekend), an animated plot of rentals per rental zone and weekday is needed. To create an animation of bookings on a weekday-basis, there is a column with the accoring weekday needed. It is deduced from the DATE_FROM column. The dataframe is then grouped by the rental zone and the weekday and is aggregated with the sum of bookings. The results of the counts of bookings for each rental zone is then shown in a map, animated for every weekday.

In [35]:
# Creating dataframe weekly_animation from copy of cluster_data and add WEEKDAY column
weekly_animation = cluster_data.copy()
weekly_animation['WEEKDAY'] = pd.to_datetime(cluster_data['DATE_FROM']).apply(lambda x: x.weekday())
weekly_animation.head()
Out[35]:
BOOKING_HAL_ID VEHICLE_HAL_ID CUSTOMER_HAL_ID DATE_FROM DATE_UNTIL START_RENTAL_ZONE START_RENTAL_ZONE_HAL_ID END_RENTAL_ZONE END_RENTAL_ZONE_HAL_ID START_LONGITUDE START_LATITUDE END_LONGITUDE END_LATITUDE DISTANCE BOOKING_DURATION WEEKDAY
0 27689266 117157 850FFC361BA66AC4918B50DAD1CE90893ACDBC83 2015-01-01 01:29:04 2015-01-01 01:29:39 Klinikum Kassel / Mönchebergstraße 204820 Klinikum Kassel / Mönchebergstraße 204820 9.509167 51.327222 9.509167 51.327222 0.000000 0.583333 3
1 27689762 116780 C793F8DF508C323CC19FC3DA031B12FFE26861E9 2015-01-01 03:17:55 2015-01-01 03:26:56 Schwimmbad Auedamm 204782 Uni-Kassel / Menzelstr. 204785 9.499444 51.300278 9.486944 51.303611 0.947364 9.016667 3
2 27689082 117149 36AB6ABBC0733C2B96BBC604FD027EEF11E9A692 2015-01-01 00:43:22 2015-01-01 02:26:09 Karthäuser Str. / FES 204803 Karthäuser Str. / FES 204803 9.485278 51.314444 9.485278 51.314444 0.000000 102.783333 3
3 27689296 117325 FC2E123D74CC42F02CA74668EA441CEA4BA73708 2015-01-01 01:38:58 2015-01-01 01:49:40 Haltestelle Kirchweg/Wilhelmshöher Allee 204792 Auestadion 204781 9.465000 51.311944 9.481111 51.298611 1.860846 10.700000 3
4 27689735 108208 7CBCDCAAEFF82B086D96F1497F059EB540ED3CCE 2015-01-01 03:11:55 2015-01-01 03:14:17 Georg-Stock-Platz / Kohlenstraße 204788 Georg-Stock-Platz / Kohlenstraße 204788 9.468056 51.309167 9.468056 51.309167 0.000000 2.366667 3
In [36]:
# Grouping by START_LAT, START_LONG, NAME, WEEKDAY
weekly_animation_group = weekly_animation.groupby(['START_LATITUDE', 'START_LONGITUDE', 'START_RENTAL_ZONE', 'WEEKDAY']).count()
weekly_animation_group = weekly_animation_group.add_suffix('_COUNT').reset_index().sort_values(by='WEEKDAY')
weekly_animation_group.head()
Out[36]:
START_LATITUDE START_LONGITUDE START_RENTAL_ZONE WEEKDAY BOOKING_HAL_ID_COUNT VEHICLE_HAL_ID_COUNT CUSTOMER_HAL_ID_COUNT DATE_FROM_COUNT DATE_UNTIL_COUNT START_RENTAL_ZONE_HAL_ID_COUNT END_RENTAL_ZONE_COUNT END_RENTAL_ZONE_HAL_ID_COUNT END_LONGITUDE_COUNT END_LATITUDE_COUNT DISTANCE_COUNT BOOKING_DURATION_COUNT
0 51.274521 9.448309 Brandgasse / Altenbaunaer Straße 0 106 106 106 106 106 106 106 106 105 105 105 106
371 51.338563 9.493329 Hegelsbergstraße / Bunsenstraße 0 699 699 699 699 699 699 699 699 699 699 699 699
238 51.315000 9.501667 Markthalle / Graben 0 501 501 501 501 501 501 501 501 501 501 501 501
84 51.304722 9.443889 Marbachshöhe / Marie-Calm-Straße 0 300 300 300 300 300 300 300 300 293 293 293 300
392 51.340000 9.480278 Holländische Str. / KVG 0 175 175 175 175 175 175 175 175 175 175 175 175
In [37]:
# To cover the time aspect the following animated map reveals the rental zone load on a shifting time frame per weekday
fig = px.scatter_mapbox(weekly_animation_group, lat="START_LATITUDE", lon="START_LONGITUDE", hover_name="START_RENTAL_ZONE", size="CUSTOMER_HAL_ID_COUNT",
               animation_frame="WEEKDAY", zoom=11.8, height=600)
fig = fig.update_layout(mapbox_style="open-street-map")
fig = fig.update_layout(margin={"r":0,"t":0,"l":0,"b":0})
fig.show()

Unfortunately it is not possible to observe a significant change in demand patterns for the different weekdays. The visualization only shows that the highest demand is between Thuesday and Thursday. On the weekend there is a lower demand of bikes.

Hourly Demand Visualization

To examine if there is a shift from a certain group of rental zones to another group of rental zones in the course of a day (e.g. from the outscirts to the inner city in the morining), an animated plot of rentals per rental zone and hour of day is needed. To create an animation of bookings on a hourly-basis, there is a column with the accoring hour of day needed. It is deduced from the DATE_FROM column. The dataframe is then grouped by the rental zone and the hour of day and is aggregated with the sum of bookings. The results of the counts of bookings for each rental zone is then shown in a map, animated for every hour of day.

In [38]:
# Creating dataframe hourly-animation from copy of cluster_data and add the hour of day
hourly_animation = cluster_data.copy()
hourly_animation['HOUR'] = pd.to_datetime(cluster_data['DATE_FROM']).apply(lambda x: x.hour)
hourly_animation.head()
Out[38]:
BOOKING_HAL_ID VEHICLE_HAL_ID CUSTOMER_HAL_ID DATE_FROM DATE_UNTIL START_RENTAL_ZONE START_RENTAL_ZONE_HAL_ID END_RENTAL_ZONE END_RENTAL_ZONE_HAL_ID START_LONGITUDE START_LATITUDE END_LONGITUDE END_LATITUDE DISTANCE BOOKING_DURATION HOUR
0 27689266 117157 850FFC361BA66AC4918B50DAD1CE90893ACDBC83 2015-01-01 01:29:04 2015-01-01 01:29:39 Klinikum Kassel / Mönchebergstraße 204820 Klinikum Kassel / Mönchebergstraße 204820 9.509167 51.327222 9.509167 51.327222 0.000000 0.583333 1
1 27689762 116780 C793F8DF508C323CC19FC3DA031B12FFE26861E9 2015-01-01 03:17:55 2015-01-01 03:26:56 Schwimmbad Auedamm 204782 Uni-Kassel / Menzelstr. 204785 9.499444 51.300278 9.486944 51.303611 0.947364 9.016667 3
2 27689082 117149 36AB6ABBC0733C2B96BBC604FD027EEF11E9A692 2015-01-01 00:43:22 2015-01-01 02:26:09 Karthäuser Str. / FES 204803 Karthäuser Str. / FES 204803 9.485278 51.314444 9.485278 51.314444 0.000000 102.783333 0
3 27689296 117325 FC2E123D74CC42F02CA74668EA441CEA4BA73708 2015-01-01 01:38:58 2015-01-01 01:49:40 Haltestelle Kirchweg/Wilhelmshöher Allee 204792 Auestadion 204781 9.465000 51.311944 9.481111 51.298611 1.860846 10.700000 1
4 27689735 108208 7CBCDCAAEFF82B086D96F1497F059EB540ED3CCE 2015-01-01 03:11:55 2015-01-01 03:14:17 Georg-Stock-Platz / Kohlenstraße 204788 Georg-Stock-Platz / Kohlenstraße 204788 9.468056 51.309167 9.468056 51.309167 0.000000 2.366667 3
In [39]:
# Grouping by START_LAT, START_LONG, NAME, HOUR

hourly_animation_group = hourly_animation.groupby(['START_LATITUDE', 'START_LONGITUDE', 'START_RENTAL_ZONE', 'HOUR']).count()
hourly_animation_group = hourly_animation_group.add_suffix('_COUNT').reset_index().sort_values(by='HOUR')
hourly_animation_group.head()
Out[39]:
START_LATITUDE START_LONGITUDE START_RENTAL_ZONE HOUR BOOKING_HAL_ID_COUNT VEHICLE_HAL_ID_COUNT CUSTOMER_HAL_ID_COUNT DATE_FROM_COUNT DATE_UNTIL_COUNT START_RENTAL_ZONE_HAL_ID_COUNT END_RENTAL_ZONE_COUNT END_RENTAL_ZONE_HAL_ID_COUNT END_LONGITUDE_COUNT END_LATITUDE_COUNT DISTANCE_COUNT BOOKING_DURATION_COUNT
0 51.274521 9.448309 Brandgasse / Altenbaunaer Straße 0 22 22 22 22 22 22 22 22 22 22 22 22
668 51.313333 9.494444 Friedrichsplatz 0 501 501 501 501 501 501 501 501 497 497 497 501
644 51.313056 9.536389 Salzmannshausen / SMA 0 55 55 55 55 55 55 55 55 55 55 55 55
620 51.313056 9.508611 Unterneustadt 0 723 723 723 723 723 723 723 723 723 723 723 723
596 51.312778 9.445833 ICE Bahnhof / Willy-Brandt-Platz 0 429 429 429 429 429 429 429 429 425 425 425 429
In [40]:
# To cover the time aspect the following animated map reveals the rental zone load on a shifting time frame per hour of a day
fig = px.scatter_mapbox(hourly_animation_group, lat="START_LATITUDE", lon="START_LONGITUDE", hover_name="START_RENTAL_ZONE", size="CUSTOMER_HAL_ID_COUNT",
               animation_frame="HOUR", zoom=11.8, height=600)
fig = fig.update_layout(mapbox_style="open-street-map")
fig = fig.update_layout(margin={"r":0,"t":0,"l":0,"b":0})
fig.show()

This visualization shows a peak demand between 7am and 7pm. The lowest demand is shown between 1am and 7am with a minmun at 6am. It is observable that the demand for the various rental zones increase and shrink at a similar rate in the hour.

Clustering of Stations by Popularity and Time

In addition to the pure rental station pattern visualization it is also possible to cluster the stations based on time and popularity of the specific station. Time in this case will be treated on a daily basis. Therefore a HOUR_OF_DAY column is appended to the clustering dataset containing the hour of day from 0 to 24.

In [41]:
# Add HOUR_OF_DAY column to cluster_data
cluster_data['HOUR_OF_DAY'] = pd.to_datetime(cluster_data['DATE_FROM']).dt.hour

In order to identify different patterns depending on the start and end rental zones, the following code groups the dataframe according to the start and respecitvely end rental zone by counting the number of bookings at each station for each hour of the day. The booking count represents the metric for popularity in the following.

In [42]:
# Create respective dataframes for the datasets in order to cluster by HOUR_OF_DAY and COUNT of bookings
booking_start_counts = cluster_data.groupby(['START_LATITUDE', 'START_LONGITUDE', 'HOUR_OF_DAY']).size().reset_index(name="COUNT")
booking_end_counts = cluster_data.groupby(['END_LATITUDE', 'END_LONGITUDE', 'HOUR_OF_DAY']).size().reset_index(name="COUNT")
booking_start_counts.head()
Out[42]:
START_LATITUDE START_LONGITUDE HOUR_OF_DAY COUNT
0 51.274521 9.448309 0 22
1 51.274521 9.448309 1 12
2 51.274521 9.448309 2 20
3 51.274521 9.448309 3 11
4 51.274521 9.448309 4 8
In [43]:
# Define functions for later use
def map_hour_to_cyclic_time(hour):
    """
    Maps hours to integer values in such a way that afternoon and noon have the same distance to the night.
    This is required for k-means to work properly to deliver reasonably results because it depends on the euclidean metric.
    """
    if (hour >= 5) and (hour <= 13):
        return 1
    elif (hour >= 14) and (hour <= 21):
        return 3
    return 2

def plot_elbow_graph(df):
    """
    The elbow method is used to visualize the cluster losses of k-means obtained by the model.inertia_ 
    """
    clusters = []
    losses = []
    for k in range(10):
        model = KMeans(n_clusters=k + 1, random_state=1337, init='k-means++')
        model.fit(df)
        clusters.append(k + 1)
        losses.append(model.inertia_)

    plt.plot(clusters, losses)
    plt.show()

K-means is very sensitive to outliers. HOUR_OF_DAY and COUNT must be scaled properly for the best results. Sklearn's StandardScaler is used here.

In [44]:
# Scale values and plot elbow graph
scaler = StandardScaler()
booking_start_counts_scaled = booking_start_counts.copy()
booking_start_counts_scaled['HOUR_OF_DAY'] = booking_start_counts_scaled['HOUR_OF_DAY'].apply(map_hour_to_cyclic_time)
booking_start_counts_scaled[['COUNT', 'HOUR_OF_DAY']] = scaler.fit_transform(booking_start_counts_scaled[['COUNT', 'HOUR_OF_DAY']])
booking_end_counts_scaled = booking_end_counts.copy()
booking_end_counts_scaled['HOUR_OF_DAY'] = booking_end_counts_scaled['HOUR_OF_DAY'].apply(map_hour_to_cyclic_time)
booking_end_counts_scaled[['COUNT', 'HOUR_OF_DAY']] = scaler.fit_transform(booking_end_counts_scaled[['COUNT', 'HOUR_OF_DAY']])
plot_elbow_graph(booking_start_counts_scaled[['COUNT', 'HOUR_OF_DAY']])

Important to note is that the k-means++ algorithm instead of the "normal" one is used to fit the model. The advantage of the k-means++ is that it is not sensitive to find local optima because it selects the cluster centroids for the defined number of clusters randomly. This decreases the probability of finding a local optimum dramatically.

In [45]:
# Fit a k-means++ model with the booking start counts
n_clusters = 4
model = KMeans(n_clusters=n_clusters, random_state=1337, init='k-means++')
model.fit(booking_start_counts_scaled[['COUNT', 'HOUR_OF_DAY']])
Out[45]:
KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
       n_clusters=4, n_init=10, n_jobs=None, precompute_distances='auto',
       random_state=1337, tol=0.0001, verbose=0)
In [46]:
# Plot clustered start rental zone bookings
clustered_booking_start_counts = booking_start_counts.copy()
# Attach the cluster result to the unscaled dataframe for meaningful plotting
clustered_booking_start_counts['CLUSTER'] = model.predict(booking_start_counts_scaled[['COUNT', 'HOUR_OF_DAY']])
sns.scatterplot(x='HOUR_OF_DAY', y='COUNT', data=clustered_booking_start_counts, hue='CLUSTER').set_title('Clustered start rental zone bookings')
plt.show()
Out[46]:
Text(0.5, 1.0, 'Clustered start rental zone bookings')

The cluster 0 is both, in the right and the left of the plot, because the distance has been mapped with the map_hour_to_cyclic_time function.

Additionally, it is important to note that the end rental zones are clustered with the start rental zone model to have a good comparison between the two results.

In [47]:
# Plot clustered end rental zone bookings (clustered with the start rental zone model for comparative power)
clustered_booking_end_counts = booking_end_counts.copy()
clustered_booking_end_counts['CLUSTER'] = model.predict(booking_end_counts_scaled[['COUNT', 'HOUR_OF_DAY']])
sns.scatterplot(x='HOUR_OF_DAY', y='COUNT', data=clustered_booking_end_counts, hue='CLUSTER').set_title('Clustered end rental zone bookings')
plt.show()
Out[47]:
Text(0.5, 1.0, 'Clustered end rental zone bookings')
In [48]:
# Afternoon
sns.scatterplot(x='START_LATITUDE', y='START_LONGITUDE', data=clustered_booking_start_counts[clustered_booking_start_counts['CLUSTER'] == 3], hue='COUNT', size='COUNT')
plt.show()
sns.scatterplot(x='END_LATITUDE', y='END_LONGITUDE', data=clustered_booking_end_counts[clustered_booking_end_counts['CLUSTER'] == 3], hue='COUNT', size='COUNT')
plt.show()
Out[48]:
<matplotlib.axes._subplots.AxesSubplot at 0x17b395240>
Out[48]:
<matplotlib.axes._subplots.AxesSubplot at 0x183d36f98>

In the afternoon, many people drive from the city to the outscirts.

Clustering of Trip Types and Customers

In the following the cusotmers will be clustered by their bookings. It is possible that there is a correlation between the booking duration and the distance between the start and end rental zone. Therfore, as a first clustering task, the bookings are plotted by booking duration and distance and then clustered.

First, a copy of the dataframe cluster_data with the columns 'BOOKING_DURATION' and 'DISTANCE' is created as X.

In [49]:
# Setting X as duration and distance for clustering
X = cluster_data[['BOOKING_DURATION', 'DISTANCE']].copy()
X.dropna(axis=0, inplace=True)
X.head()
Out[49]:
BOOKING_DURATION DISTANCE
0 0.583333 0.000000
1 9.016667 0.947364
2 102.783333 0.000000
3 10.700000 1.860846
4 2.366667 0.000000

Because k-means++ build clusters with the eclueadean distance, the calculation of clusters is very sensitive to outliers. The clusters would be distorted in that case. Therefore, clusters have to be removed. But as a first step, the data has to be plotted to find out if there are outlieres.

In [50]:
# Building boxplot for outlier detection
X.boxplot(['BOOKING_DURATION'],figsize = (10,20),grid=True);

The boxplot shows that the there are many extreme outliers with a distance of more than one and a half interquartile range. Therfore, it is necessary to remove them.

For the removal of the outliers, the z-score is used.

"The Z-score method relies on the mean and standard deviation of a group of data to measure central tendency and dispersion. This is troublesome, because the mean and standard deviation are highly affected by outliers – they are not robust. In fact, the skewing that outliers bring is one of the biggest reasons for finding and removing outliers from a dataset!

The Z-score, or standard score, is a way of describing a data point in terms of its relationship to the mean and standard deviation of a group of points. Taking a Z-score is simply mapping the data onto a distribution whose mean is defined as 0 and whose standard deviation is defined as 1.

The goal of taking Z-scores is to remove the effects of the location and scale of the data, allowing different datasets to be compared directly. The intuition behind the Z-score method of outlier detection is that, once we’ve centred and rescaled the data, anything that is too far from zero (the threshold is usually a Z-score of 3 or -3) should be considered an outlier." [http://colingorrie.github.io/outlier-detection.html#z-score-method]

In [51]:
# Calculating z-score
z = np.abs(stats.zscore(X['BOOKING_DURATION']))
In [52]:
# Cleaning data for clustering
X_Cleaned = X.loc[z<3,:]
X_Cleaned.describe()
Out[52]:
BOOKING_DURATION DISTANCE
count 345367.000000 345367.000000
mean 24.855489 1.613369
std 40.717474 1.132133
min 0.016667 0.000000
25% 7.866667 0.773067
50% 12.666667 1.505922
75% 22.125000 2.293980
max 418.900000 8.688161

As seen in the description of the data above, the maximal duration is only 419 min instead of 9204 min.

Because the units of distance and duration are different, the data has to be scaled. This makes it possible to calcualte more natural clusters.

In [53]:
# Scale the data because k-means++ is sensitive to outliers
scaler = StandardScaler();
scaler.fit(X_Cleaned);
X_scaled = scaler.transform(X_Cleaned);
X_scaled_df = pd.DataFrame(X_scaled,columns=X_Cleaned.columns, index = X_Cleaned.index);
X_scaled_df.head()
Out[53]:
StandardScaler(copy=True, with_mean=True, with_std=True)
Out[53]:
BOOKING_DURATION DISTANCE
0 -0.596112 -1.425072
1 -0.388994 -0.588275
2 1.913870 -1.425072
3 -0.347652 0.218594
4 -0.552315 -1.425072

For a first impression of the data, a scatterplot is created.

In [54]:
# Scatterplot
plt.figure(figsize=(15, 10), dpi=80)
fig = plt.scatter(data=X_scaled_df, x='BOOKING_DURATION',y='DISTANCE',s=6);

On the first glance, the duration and distance does not correlate as expected. A longer booking does not mean that the distance is longer.

To get a visual estimate of the best amount of clusters, an elbow method is plotted.

In [55]:
# Plotting elbow method
plot_elbow_graph(X_scaled_df)

The plot of the elbow method suggests that 5 clusters are a good fit.

In [56]:
# Fit a k-means++ model
five_means = KMeans(n_clusters = 5, random_state=1337, init='k-means++');
five_means.fit(X_scaled);
five_means.predict(X_scaled);

For an easier interpretation of the later plot, the first scaled data is now scaled back.

In [57]:
# Scaling back for better understanding of plots
five_means_new_inverse = scaler.inverse_transform(X_scaled)
five_means_new_inverse_df = pd.DataFrame(five_means_new_inverse,columns=X_Cleaned.columns, index = X_Cleaned.index)
five_means_new_inverse_df.describe()
Out[57]:
BOOKING_DURATION DISTANCE
count 345367.000000 345367.000000
mean 24.855489 1.613369
std 40.717474 1.132133
min 0.016667 0.000000
25% 7.866667 0.773067
50% 12.666667 1.505922
75% 22.125000 2.293980
max 418.900000 8.688161
In [58]:
# Plotting the five clusters
flatui = ["#d7191c", "#fdae61", "#ffffbf", "#abd9e9", "#2c7bb6"]

numbers = ["zero", "one", "two", "three", "four", "five"]

five_means_new_inverse_df["five"] = five_means.predict(X_scaled)
five_means_new_inverse_df["five"] = five_means_new_inverse_df["five"].apply(lambda x: numbers[x])
sns.pairplot(data=five_means_new_inverse_df,height=8,hue='five',palette=sns.color_palette(flatui));

For a better visualization, the plot in the lower left corner is plotted seperately in in the following cell.

In [59]:
# Scatterplot
plt.figure(figsize=(15, 10))
sns.scatterplot(data=five_means_new_inverse_df, y='DISTANCE', x='BOOKING_DURATION', hue='five', palette=sns.color_palette(["#d7191c", "#fdae61", "#ffffbf", "#abd9e9", "#2c7bb6"]),s=8);

The bookings were divided into 5 clusters:

  • red: short to medium time of rent (ca. 0 - 50 min) and short distance of travel (ca. 0 - 1 km).
  • yellow: short to medium time of rent (ca. 0 - 50 min) and medium distance of travel (ca. 1 - 3 km).
  • light blue: short to medium time of rent (ca. 0 - 200 min) and long distance of travel (ca. 3 - 8 km).
  • orange: medium to long time of rent (ca. 50 - 190 min) and short to long distance of travel (ca. 0 - 4 km).
  • dark blue: very long time of rent (ca. 200 - 400 min) and short to long distance of travel (ca. 0 - 8 km).

The customers with a distance of zero could be round-trip driver. Trips in the dark blue cluster could be day trips where the bike is rented for a few hours. The red, yellow, and light blue trips are probably commuter trips.

It may be interesting to cluster the customers by their most frequent start rental time and their amount of bookings. A result could be that customers who rent a similar amount of time also rent at the same time (e.g. commuter, pleasure biker, etc.).

In order to do that clustering, the dateframe customer_bookings is created. It lists the hours of the start booking times of all bookings per customer ('RENTHOURS').

In [60]:
# Creating a list of hours of bookings for each customer
customer_bookings = cluster_data.groupby('CUSTOMER_HAL_ID')['HOUR_OF_DAY'].apply(list).reset_index(name='RENTHOURS')
customer_bookings.head()
Out[60]:
CUSTOMER_HAL_ID RENTHOURS
0 000522A88E977EB0C935B8AC1D520000EFE4FD8C [0]
1 000A2ACA23E1AB0C12CFD2A266C2BA1E0B2F612B [13]
2 000E30498A90738D1231FBA60F3146F6339D9B41 [12, 23, 23, 23, 22, 22, 19, 15, 10, 14, 17, 1...
3 0020456281C94A7E92A7405D7B9C7E4BB0BECC9F [14, 13, 15]
4 0035083494A8C7A7750962A7727137A4538B8F99 [8, 12, 9, 15, 9]

To get the hour of day where a customer has started the most bookings, a third column ('MODUS') is created, that calculates the modus of the list (the most occuring times).

In [61]:
# Adding MODUS coulmn with modus of hours of time of bookings per customer
customer_bookings['MODUS'] = customer_bookings['RENTHOURS'].apply(lambda x: stats.mode(np.array(x).astype(np.int))[0][0])

To get the number of bookings, a customer has done in the period of two years, the length of the list is calculated and added as the value for the column 'COUNT'.

In [62]:
# Adding COUNT column with number of rentals per customer
customer_bookings['COUNT'] = customer_bookings['RENTHOURS'].apply(lambda x: len(x))

For a first impression of the data, a scatterplot is created.

In [63]:
# Scatterplot
plt.figure(figsize=(15, 10), dpi=80)
fig = plt.scatter(data=customer_bookings, x='MODUS',y='COUNT',s=6);

For an overview of the number of customers renting most favorably at certain times, a plot is created where the customers are grouped by ther most occuring start booking time.

In [64]:
# Plotting the number of customers for their most rented time of day
sns.scatterplot(x=customer_bookings.groupby(['MODUS']).count().index, y='COUNT', data=customer_bookings.groupby(['MODUS']).count());

A seasonality of the day can be observed.

Hour of the Day

It is necessary to map each hour onto a circle such that the lowest value for that variable appears right next to the largest value, by computing the x- and y- component of that point using sin and cos trigonometric functions. This transformation will be applied to all time features in this analysis.

image.png source: Link

In [65]:
# Adding sin and cos of hour for more natural calculations
customer_bookings['HOUR_SIN'] = np.sin(customer_bookings.MODUS * (2. * np.pi / 24))
customer_bookings['HOUR_COS'] = np.cos(customer_bookings.MODUS * (2. * np.pi / 24))

Because the units of time and count are different, the data has to be scaled. This makes it possible to calcualte more natural clusters.

In [66]:
# Scale data for clustering
X2 = customer_bookings[['HOUR_SIN', 'HOUR_COS', 'COUNT']]
scaler = StandardScaler()
scaler.fit(X2)
X2_scaled = scaler.transform(X2)
X2_scaled_df = pd.DataFrame(X2_scaled,columns=X2.columns, index = X2.index)
X2_scaled_df.head()
Out[66]:
StandardScaler(copy=True, with_mean=True, with_std=True)
Out[66]:
HOUR_SIN HOUR_COS COUNT
0 0.446049 1.584620 -0.482709
1 0.026824 -1.136355 -0.482709
2 -0.699295 -0.778132 0.675588
3 0.026824 -1.136355 -0.434446
4 1.591393 -0.778132 -0.386184

To get a visual estimate of the best amount of clusters, an elbow method is plotted.

In [67]:
# Plotting elbow method
plot_elbow_graph(X2_scaled_df)

The plot of the elbow method suggests that 5 clusters are a good fit.

In [68]:
# Fit a k-means++ model with the booking start counts
four_means = KMeans(n_clusters = 4, random_state=1337, init='k-means++');
four_means.fit(X2_scaled);
four_means.predict(X2_scaled);

For an easier interpretation of the later plot, the first scaled data is now scaled back.

In [69]:
# Scaling back for better understanding of plots
X2_new_inverse = scaler.inverse_transform(X2_scaled);
X2_new_inverse_df = pd.DataFrame(X2_new_inverse,columns=X2.columns, index = X2.index);
X2_new_inverse_df['MODUS'] = customer_bookings['MODUS'];
X2_new_inverse_df.describe()
Out[69]:
HOUR_SIN HOUR_COS COUNT MODUS
count 16733.000000 16733.000000 16733.000000 16733.000000
mean -0.275380 -0.144901 21.003526 12.452698
std 0.617394 0.722530 41.441394 6.482105
min -1.000000 -1.000000 1.000000 0.000000
25% -0.866025 -0.866025 2.000000 9.000000
50% -0.500000 -0.258819 7.000000 14.000000
75% 0.258819 0.500000 21.000000 17.000000
max 1.000000 1.000000 807.000000 23.000000
In [70]:
# Plotting clusters

# Use colours that do not discriminate against colour blindness.
flatui = ["#d7191c", "#fdae61", "#ffffbf", "#abd9e9", "#2c7bb6"]

numbers = ["zero", "one", "two", "three", "four"]

X2_new_inverse_df["four"] = four_means.predict(X2_scaled);
X2_new_inverse_df["four"] = X2_new_inverse_df["four"].apply(lambda x: numbers[x]);
sns.pairplot(data=X2_new_inverse_df,height=8,hue="four",palette=sns.color_palette(flatui));

For a better visualization, the plot in the second to last row on the right is plotted seperately in in the following cell.

In [71]:
# Scatterplot
plt.figure(figsize=(15, 10))
sns.scatterplot(data=X2_new_inverse_df, y='COUNT', x='MODUS', hue='four', palette=sns.color_palette(["#d7191c", "#fdae61", "#ffffbf", "#abd9e9"]),s=50);

The customers clusters can be divided into two types:

There are three types of customers who rent between 1 and ca. 100 times:

  • red: Night owls, who rent preferably at night.
  • orange: Early birds, who rent preferably in the morning and before noon.
  • yellow: Evening driver, Those who rent preferably in the afternoon and evening.

Then there are of customers who rent between 100 and 800 times:

  • light blue: Frequent drivers, rent the most during the day (8am - 8pm). They usally do not rend in the night and at early mornings.

Inspection of the Raw Time Series

In this section the trajectory of different resolutions is plotted. It can be seen that rental demand is especially high in the middle of the year and low at the start and end. In addition the graphs show that demand is highly volatile with an increasing variance in the middle of the year.

Counts over time

In [72]:
# Creating fig and subplots
fig, axs = plt.subplots(4, 1, figsize=(17, 15), sharex=True)

# One hour resolution plot
one_h_data['COUNT'].plot(ax=axs[0])
axs[0].grid(True)
axs[0].set_title('Hourly')

# Two hour resolution plot
two_h_data['COUNT'].plot(ax=axs[1])
axs[1].grid(True)
axs[1].set_title('Two Hour Resolution')

# Six hour resolution plot
six_h_data['COUNT'].plot(ax=axs[2])
axs[2].grid(True)
axs[2].set_title('Six Hour Resolution')

# 24 hour resolution plot
twentyfour_h_data['COUNT'].plot(ax=axs[3])
axs[3].grid(True)
axs[3].set_title('Daily')
axs[3].set_xlabel('Date')

fig.text(0.5, 0.92, 'Bike Rental Demand', ha='center', size=30);

Mean shifts over time

In order to detail the above insights in the following box-plots will be used to inspect the mean shifts that occur on a (sub-)daily, weekly and monthly bases. Since the above showed seasonal differences that depend on the month of the year, the data will be enriched by the information in which season of the year it falls in, in order to filter out the impact of a specific season on (sub-)daily rentals.

Conclusions that can be drawn from the below graphs are as follows:

Hour of the Day:
  • strong indication for sub-daily seasonality
    • demand is very low during night (2 pm - 6 am) and very high from (3 pm - 20 pm)
    • this pattern does not change in respect to the season
  • during spring and summer the demand is highly volatile
  • mean demand is highest in summer followed by spring, fall and winter
Day of Week
  • indication of intra-weekly seasonality
    • cyclical pattern during summer month with frequency increasing from start of the week to thursday and then decreasing
    • pattern is not clear/observable during the other seasons
Week of the Year and Month of the Year
  • indication for intra-yearly seasonality
    • affirmation of the above findings i.e. increasing mean bike rental demand and variance towards middle of the year
In [73]:
# Creating individual HOUR, DAY_OF_WEEK, DAY_OF_MONTH, WEEK_OF_YEAR and MONTH_OF_YEAR columns
one_h_data['HOUR'] = one_h_data.index.strftime('%H')
one_h_data['DAY_OF_WEEK'] = one_h_data.index.strftime('%a')
one_h_data['DAY_OF_MONTH'] = one_h_data.index.strftime('%d')
one_h_data['WEEK_OF_YEAR'] = one_h_data.index.strftime('%W')
one_h_data['MONTH_OF_YEAR'] = one_h_data.index.strftime('%m')

# Creating a specific cat type to allow logical ordering of DAY_OF_WEEK
cat_type = pd.api.types.CategoricalDtype(categories=['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun'], ordered=True)
one_h_data['DAY_OF_WEEK'] = one_h_data['DAY_OF_WEEK'].astype(cat_type)

# Creating a SEASON column based on the specific month
seasons = []
for month in one_h_data.index.strftime('%m').astype('int'):
    if month in [1, 2, 12]:
        seasons.append('WINTER')
    elif month in [3, 4, 5]:
        seasons.append('SPRING')
    elif month in [6, 7, 8, 9]:
        seasons.append('SUMMER')
    elif month in [10, 11]:
        seasons.append('FALL')
one_h_data['SEASON'] = seasons

# Creating fig and subplots
fig = plt.figure(figsize=(17, 30))
ax1 = fig.add_subplot(511)
ax2 = fig.add_subplot(512)
ax3 = fig.add_subplot(513)
ax4 = fig.add_subplot(514)
ax5 = fig.add_subplot(515)

# Plotting hour of the day boxplots
hue_order = ['SPRING', 'SUMMER', 'FALL', 'WINTER']
sns.boxplot(x='HOUR', y='COUNT', hue='SEASON', hue_order=hue_order, data=one_h_data, ax=ax1)
ax1.set_title('Hour of the Day')
ax1.set_xlabel('')
ax1.grid(True)

# Plotting day of the week boxplots
sns.boxplot(x='DAY_OF_WEEK', y='COUNT', hue='SEASON', hue_order=hue_order, data=one_h_data, ax=ax2).legend_.remove()
ax2.set_title('Day of the Week')
ax2.set_xlabel('')
ax2.grid(True)

# Plotting day of the month boxplots
sns.boxplot(x='DAY_OF_MONTH', y='COUNT', hue='SEASON', hue_order=hue_order, data=one_h_data, ax=ax3).legend_.remove()
ax3.set_title('Day of the Month')
ax3.set_xlabel('')
ax3.grid(True)

# Plotting week of the year boxplots
sns.boxplot(x='WEEK_OF_YEAR', y='COUNT', data=one_h_data, ax=ax4)
ax4.set_title('Week of the Year')
ax4.set_xlabel('')
ax4.grid(True)

# Plotting month of the year boxplots
sns.boxplot(x='MONTH_OF_YEAR', y='COUNT', data=one_h_data, ax=ax5)
ax5.set_title('Month of the Year')
ax5.set_xlabel('')
ax5.grid(True)

fig.text(0.5, 0.92, 'Seasonalities in Bike Rental Demand', ha='center', size=30);

Predictive Analysis

In the following chapter predictive analysis will be conducted for each resolution. First a feature examination and feature engineering will be carried out. Second appropriate algorithms will be trained and their performance evaluated using rolling cross validation.

Daily Data

The task of daily bike rental demand is very important. It can help to increase service quality since the re-distribution of bikes can be planned roughly for the following day. It enables the company to allocate the necessary amount of ressources (e.g. service-workers), which is difficult on a sub-daily basis. It therefore depicts a central part of mid-term planning.

Feature Examination

Numerical Features

In addition to the descriptive analysis which showed that the trajectory itself already contains information in regarding to its realization a specific time or time frame, other useful features have been added during data preparation. The following graphs show the correlation between variables, represented by a fitted linear regression line, on the upper diagonal, the frequency on the main diagonal and the conditional distribution of each variable in respect to the other on the lower diagonal.

As one could imagine the graphs show that sunshine duration as well as temperature are positively correlated to the number of bookings, which seems reasonable since people should be more likely to ride a bike when it is dry and warm outside. Not surprisingly do humidity, the precipitation height as well as wind speed exert a negative impact on the number of bookings. Precipitation height however does this with a lot of uncertainty which can be explained by the fact that zero is the most frequent precipitation height leaving relatively few observations to observe the impact on bike rental demand.

All in all it can be seen, that any numerical feature exerts an influence on the target variable. Therefore all of these features will be included during prediction and the above assumptions will be re-evaluated.

In [74]:
# Creating a subset of only numerical features
numerical_features = twentyfour_h_data[['COUNT', 'SUNSHINE_DURATION', 'AIR_TEMP', 'HUMIDITY',
                                        'PRECIPITATION_HEIGHT', 'MEAN_WIND_SPEED']]

# Initializing and plotting a pair grid with regression plots on the upper, hist plots on the main and kde plots 
# on the lower diagonal
g = sns.PairGrid(numerical_features)
g = g.map_upper(sns.regplot, scatter_kws={'alpha':0.7}, line_kws={'color': 'red'})
g = g.map_diag(plt.hist)
g = g.map_lower(sns.kdeplot, legend=False);

Binary Features

The box-plots below visualize the difference in mean within the range of the binary features. It can be seen that all three binary features differ within its range. Therefore all three variables will be used in prediction.

In [75]:
# Creating fig and subplots
fig, ax = plt.subplots(1, 3, figsize=(20, 15), sharey=True)

# Plotting rain fallen boxplots
sns.boxplot(x='RAIN_FALLEN', y='COUNT', data=twentyfour_h_data, ax=ax[0])
ax[0].set_title('Rain Fallen')
ax[0].set_xlabel('')

# Plotting bank holiday boxplots
sns.boxplot(x='BANK_HOLIDAY', y='COUNT', data=twentyfour_h_data, ax=ax[1])
ax[1].set_title('Bank Holiday')
ax[1].set_xlabel('')
ax[1].set_ylabel('')

# Plotting school holiday boxplots
sns.boxplot(x='SCHOOL_HOLIDAY', y='COUNT', data=twentyfour_h_data, ax=ax[2])
ax[2].set_title('School Holiday')
ax[2].set_xlabel('')
ax[2].set_ylabel('');

Autocorrelation

The descriptive analysis has shown that the number of counts varied over time in a seasonal manner. Since the plots showed an almost cyclical pattern the following plots show the correlation between the actual values and its lags. It becomes clear that there is autocorrelation in this data which allows the usage of traditional time series models for prediction.

In [76]:
# Creating fig and subplots
fig, axes = plt.subplots(2, 5, figsize=(20, 7), sharex=True, sharey=True, dpi=500)

# Plotting the relationship between the actual count and its lagged values 
for i, ax in enumerate(axes.flatten()[:10]):
    pd.plotting.lag_plot(twentyfour_h_data['COUNT'], lag=i + 1, ax=ax)
    ax.set_title('Lag ' + str(i + 1))

plt.tight_layout();

Prediction

The underlying daily data contains some seasonal and time dependent factors as well as some correlation with the chosen exogenous features. Since in this section of the notebook it is dealt with data on a daily resolution time-series algorithms can be used as daily resolution is the smallest acceptable granularity for most of these algorithms.

Therefore in the following two algorithms (SARIMAX and PROPHET) will be fit and evaluated in regards to their fitting abilities (explanatory power) and their predictive performance based on a rolling one-step-ahead forecast.

Seasonal Autoregressive Integrated Moving Average with exogenous regressors (SARIMAX)

The SARIMAX model is appropriate for this prediction as the data contains seasonal characteristics, autocorrelation and correlation between exogenous regressors. Less complex or more efficient models like ARIMA or ARMA do not allow for the use of exogenous regressors and do not account for seasonality, which makes them inappropriate for this matter.

To fit a good model one must find suitable parameters. The plots below show the autocorrelation as well as the partial autocorrelation function of the differenced time series.

The ACF shows a first peak crossing the lower confidence interval at lag one indicating a non seasonal MA parameter of order one. It also shows recurring peaks every eighth lag indicating a seasonal MA parameter of order 8 at period for the eighth period.

The PACF shows a first peak crossing the lower confidence interval at lag one indicating a non seasonal AR parameter of order one. It however shows no seasonal pattern indicating a seasonal AR parameter of order 0.

In [77]:
# Creating fig and subplots
fig = plt.figure(figsize=(20, 10))
ax1 = fig.add_subplot(211)
ax2 = fig.add_subplot(212)

# Plotting the autocorrelation of the actual count
fig = plot_acf(twentyfour_h_data['COUNT'].diff().dropna(axis=0),lags=50, ax=ax1)

# Plotting the partial-autocorrelation of the actual count
fig = plot_pacf(twentyfour_h_data['COUNT'].diff().dropna(axis=0),lags=50, ax=ax2)
In [78]:
# Initiating the order and oconfig of the sarimax model
sarimax_model = SARIMAX(endog=twentyfour_h_data['COUNT'], exog=twentyfour_h_data.drop(columns=['COUNT'], axis=1),
                        order=(1, 1, 1), seasonal_order=(0, 0, 1, 8), freq='D')
In [79]:
%%time

# Fitting the sarimax model
sarimax_result = sarimax_model.fit(maxiter=1000, disp=False, method='lbfgs')
CPU times: user 30.1 s, sys: 4.3 s, total: 34.4 s
Wall time: 8.74 s

The below summary shows that the coefficients of almost all exogenous variables are significant at the one percent level. Only the variable feature is insignificant on all common significant levels. In addition the coefficients seem to be significant indicating a reasonable choice of seasonal and non-seasonal orders.

Additionally, the coefficient signs seem reasonable and go in line with the findings during feature examination. An example for this can be the high positive influence of air temperature or the negative influence of school holiday.

In [80]:
# Printing the summary of the fitted model
sarimax_result.summary()
Out[80]:
SARIMAX Results
Dep. Variable: COUNT No. Observations: 731
Model: SARIMAX(1, 1, 1)x(0, 0, 1, 8) Log Likelihood -4185.772
Date: Mon, 27 Jan 2020 AIC 8395.545
Time: 22:59:30 BIC 8450.661
Sample: 01-01-2015 HQIC 8416.809
- 12-31-2016
Covariance Type: opg
coef std err z P>|z| [0.025 0.975]
SUNSHINE_DURATION 0.1393 0.019 7.330 0.000 0.102 0.177
AIR_TEMP 14.8868 1.206 12.343 0.000 12.523 17.251
HUMIDITY -4.0040 0.535 -7.481 0.000 -5.053 -2.955
PRECIPITATION_HEIGHT -9.1146 0.587 -15.531 0.000 -10.265 -7.964
RAIN_FALLEN -20.0995 7.115 -2.825 0.005 -34.044 -6.155
MEAN_WIND_SPEED -20.9263 3.674 -5.696 0.000 -28.127 -13.725
BANK_HOLIDAY -3.6239 13.763 -0.263 0.792 -30.600 23.352
SCHOOL_HOLIDAY -42.9505 16.073 -2.672 0.008 -74.453 -11.448
ar.L1 0.5225 0.036 14.424 0.000 0.452 0.594
ma.L1 -0.9259 0.017 -53.314 0.000 -0.960 -0.892
ma.S.L8 0.1753 0.034 5.093 0.000 0.108 0.243
sigma2 5594.1750 265.954 21.034 0.000 5072.914 6115.436
Ljung-Box (Q): 346.56 Jarque-Bera (JB): 13.67
Prob(Q): 0.00 Prob(JB): 0.00
Heteroskedasticity (H): 1.06 Skew: 0.02
Prob(H) (two-sided): 0.67 Kurtosis: 3.67


Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).
In [81]:
# Plotting diagnostical residual plots
fig = sarimax_result.plot_diagnostics(figsize=(17,10));

The above plots indicate that the model definition has been done reasonably well since the resulting residuals show an approximately normal distribution which is one of the key assumptions of SARIMA models. In addition most of the correlation has been removed by the model. However there is still some correlation left. This indicates that the current model does not perfectly depict the true model. To find the true model a grid search for hyper-parameters should be conducted since the determination of the true parameters gets difficult if seasonal as well as non-seasonal autoregressive and moving average characteristics are present.

In [82]:
# Plotting the fitted model
pd.DataFrame({'ACTUALS': twentyfour_h_data['COUNT'], 'PREDICTIONS': sarimax_result.fittedvalues}).plot(figsize=(20, 10));

The above graph shows the fitted values of the model which are a result of in-sample predictions. It can be seen that the model fits the data relatively well though it seems to have problems to predict comparatively high or low bike rental demand. This however is not surprising as the trajectory is highly volatile which may make it difficult for a relatively simple model representation such as SARIMAX. This however does not mean that the model has a low predictive performance. In order to evaluate this perfomance on unseen data has to be examined. Since there are other models that can deal with multiple seasonality and high volatility, SARIMAX will be seen as a baseline for the prediction of daily data in the following analysis, and will be benchmarked after competing model has been trained.

Prophet

Facebook developed a time series algorithm that fits a model that decomposes the series into three main model components: trend, seasonality and holidays. In addition it allows for external regressors to be included.

The main benefits of the PROPHET model are as follows:

  • fewer assumptions about the underlying process
  • faster fitting
  • allows for multiple seasonal changepoints and seasonalities and is therefore suitable for highly volatile data
  • determines most of the parameters automatically but allows for modification at the same time

For further information please consult https://peerj.com/preprints/3190.pdf

Therefore in the following a PROPHET model will be fit with adjusted hyperparameters accounting for the high volatility of the data.

In [83]:
# Creating a copy of twentyfour_h_data so that prophet can interpret input
prophet_df = twentyfour_h_data.copy()
prophet_df = prophet_df.reset_index().rename(columns={'DATE_FROM': 'ds', 'COUNT': 'y'})

# Creating a holiday dataframe so prophet can interpret them correctly
BANK_HOLIDAYS = pd.DataFrame({
    'holiday': 'BANK_HOLIDAYS',
    'ds': pd.to_datetime(prophet_df.loc[prophet_df.BANK_HOLIDAY == 1, 'ds']),
    'lower_window': 0,
    'upper_window': 1,
})

SCHOOL_HOLIDAYS = pd.DataFrame({
    'holiday': 'SCHOOL_HOLIDAYS',
    'ds': pd.to_datetime(prophet_df.loc[prophet_df.SCHOOL_HOLIDAY == 1, 'ds']),
    'lower_window': 0,
    'upper_window': 1,
})
holidays = pd.concat((BANK_HOLIDAYS, SCHOOL_HOLIDAYS))

# Dropping columns BANK_HOLIDAY and SCHOOL_HOLIDAY
prophet_df.drop(columns=['BANK_HOLIDAY', 'SCHOOL_HOLIDAY'], axis=1, inplace=True)

# Initialising the prophet model 
prophet_model = Prophet(holidays=holidays, seasonality_mode='additive', changepoint_prior_scale=0.5, 
                        changepoint_range=0.9, uncertainty_samples=100)

# Adding exogenous regressors to the initialised model
prophet_model.add_regressor('SUNSHINE_DURATION')
prophet_model.add_regressor('AIR_TEMP')
prophet_model.add_regressor('PRECIPITATION_HEIGHT')
prophet_model.add_regressor('MEAN_WIND_SPEED')
prophet_model.add_regressor('HUMIDITY')
prophet_model.add_regressor('RAIN_FALLEN');
In [84]:
%%time

# Fitting the model
prophet_model.fit(prophet_df);
INFO:fbprophet:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this.
CPU times: user 622 ms, sys: 10.2 ms, total: 633 ms
Wall time: 642 ms
Out[84]:
<fbprophet.forecaster.Prophet at 0x1cdfe7828>

Looking at the time it could already be seen that PROPHET fits very fast. It is about 30x faster than the SARIMAX model in regards to training time.

In [85]:
# Forecasting the fitted data for seasonal and residual plots
forecast = prophet_model.predict(prophet_df);
In [86]:
# Plotting the model components
fig = prophet_model.plot_components(forecast, weekly_start=1);

The above graph shows trend, seasonal patterns as well as the influence of the exogenous regressors and holidays.

Trend

It seems like the model has not detected the correct trend in the data since the curve looks more like a seasonal pattern.

Holidays

Prophet has also recognized the negative impact of school and business holidays it attributes an impact of sometimes 150 bikes less during a period of holidays.

Weekly Seasonality

The algorithm has detected the same cyclical nature as found in the descriptive analysis. Bike rental demand increases from the start of the week until thursday and then decreases again to the end of the week.

Yearly Seasonality

Overall it can be said that the algorithm detects the yearly seasonal patter in a reasonable way. On second thought however some of the illustrated patterns seem suspicious. Namely these consist of the dip in april which usually shows an increase through better weather conditions and the increase in november which cannot be observed looking at the raw time series. This can either be related to the not perfectly defined trend, or is true after the effects of other variables have been filtered out.

External Regressors

This graph is difficult to interpret since the impacts of all regressors are cumulated therefore the influence of each individual regressor is unclear. However it looks like the influence changes almost seasonally just like climate data is expected to.

In [87]:
# Creating fig and subplots
fig, axs = plt.subplots(2, 1, figsize=(17, 10), sharex=True)

# Plotting the fitted model
pd.DataFrame({'Actuals': prophet_df['y'], 'Predictions': forecast['yhat']}).plot(ax=axs[0])

# Plotting the residuals 
pd.DataFrame({'Residuals': prophet_df['y'] - forecast['yhat']}).plot(ax=axs[1]);

The above graphs show the fit of the model as well as the residuals. It seems that PROPHET performed not as good as the SARIMAX model in terms of fit to the data. However since the aim is to get good predictive accuracy on unseen data the models have to be cross validated.

Evaluation

In the following a rolling one step ahead cross validation will be conducted. For this purpose the models will be trained using data until a specific point in time t forecasting t + 1. This step will be repeated 41 times taking a nine period step into the future until there is no data left.

In [88]:
# Setting the period parameter
period = 9

# Setting the horizon parameter
horizon = 1
In [89]:
%%time

# Creating a list to store the prediction results
sarimax_predictions = []

# Looping throug the dataset creating a onestep-ahead prediction
for date in twentyfour_h_data.loc['2016-01-05'::period, :].index:
    
    # Setting the date to forecast
    fc_date = date + relativedelta(days=horizon)
    
    # Creating test and train splits
    sarimax_train_endog = twentyfour_h_data.loc['2015-01-01':date, 'COUNT']
    sarimax_train_exog = twentyfour_h_data.loc['2015-01-01':date, 'SUNSHINE_DURATION':]
    sarimax_test_exog = twentyfour_h_data.loc[fc_date:fc_date, 'SUNSHINE_DURATION':]

    # Initialising the order and parameters
    sarimax_model = SARIMAX(endog=sarimax_train_endog, exog=sarimax_train_exog, order=(1, 0, 1),
                            seasonal_order=(0, 0, 1, 8), freq='D', simple_differencing=True)
    
    # Fitting the model to training data
    sarimax_osh_model = sarimax_model.fit(maxiter=1000, disp=False, method='lbfgs')
    
    # One-step-ahead prediction
    predicted_values = sarimax_osh_model.predict(start=fc_date, exog=sarimax_test_exog)

    # Appending the prediction to the list
    sarimax_predictions.append(predicted_values.values[0])
CPU times: user 26min 21s, sys: 4min 17s, total: 30min 39s
Wall time: 8min 13s
In [90]:
%%time

# Starting a one-step-ahead prediction
prophet_cv = cross_validation(prophet_model, 
                              initial='365.25 days', 
                              period='{} days'.format(period), 
                              horizon='{} days'.format(horizon))
INFO:fbprophet:Making 41 forecasts with cutoffs between 2016-01-05 00:00:00 and 2016-12-30 00:00:00
CPU times: user 25.1 s, sys: 286 ms, total: 25.4 s
Wall time: 25.5 s
In [91]:
# Creating a dataframe containig the actuals and predictions of sarimax and prophet model 
daily_evaluation = pd.DataFrame({'Actuals': prophet_cv['y'],
                                 'Prophet': prophet_cv['yhat'],
                                 'SARIMAX': sarimax_predictions}).set_index(prophet_cv['ds'])

# Calculating the residuals of both models
sarimax_residuals = prophet_cv['y'].values - sarimax_predictions
prophet_residuals = prophet_cv['y'].values - prophet_cv['yhat'].values

# Creating a dataframe containing the residuals of both models
daily_residuals = pd.DataFrame({'Prophet': prophet_residuals,
                                'Sarimax': sarimax_residuals}).set_index(prophet_cv['ds'])

# Creating fig and subplots
fix, axs = plt.subplots(2, 1, figsize=(17, 10), sharex=True)

# Plotting the actuals and predictions
daily_evaluation.plot(ax=axs[0])

# Plotting the residuals
daily_residuals.plot(ax=axs[1]);
In [92]:
# Printing the mae of both models
print('Mean Absolut Error of Prohet: {0:.2f}\n'.format(performance_metrics(prophet_cv).iloc[0, 3]))

print('Mean Absolut Error of Sarimax: {0:.2f}'.format(abs(sarimax_residuals).mean()))
Mean Absolut Error of Prohet: 55.37

Mean Absolut Error of Sarimax: 63.69
In [93]:
# Printing the mae of both models
print('Mean Absolut Error of Prohet: {0:.2f}\n'.format(performance_metrics(prophet_cv).iloc[0, 3] / twentyfour_h_data['COUNT'].mean()))

print('Mean Absolut Error of Sarimax: {0:.2f}'.format(abs(sarimax_residuals).mean() / twentyfour_h_data['COUNT'].mean()))
Mean Absolut Error of Prohet: 0.12

Mean Absolut Error of Sarimax: 0.13

Looking at the above it gets clear that PROPHET is a better model for the underlying data. Apart from the fact that the PROPHET algorithm fits considerably faster it is also better at predicting the bike rental demand which can be seen comparing the mean absolute error of both models.On average Prophet eight bikes closer to the actual demand than SARIMAX is. Even though this is only a small predictive improvement, Prophet is around 40 times more efficient in terms of computational time.

Sub-Daily Data

The following section of this notebook deals with the prediction of sub-daily data. Even though PROPHET can be used for sub-daily data, it performed poorly and is therefore not considered for the prediction in this section.

The following steps will be carried out for each of the re-sampled resolutions:

  1. feature engineering to derive relevant predictors.
  2. model evaluation using rolling one-step-ahead cross validation

Using hourly data a more detailed comparison will be conducted between a baseline model (linear regression), random forest regression with default hyperparameters and a random forest regression with optimized hyperparameters. In this this the suitability of the models will be assessed and a ramdomized grid search for hyperparameters will be carried out. These hyperparameters will also be used for the other resolutions. Since the feature examination process is equals for all resolutions only the process for hourly data will be illustrated.

Hourly Data

Feature Engineering

Numerical Features

Similar to the daily data features in the hourly data show reasonable correlations. The main differences are that the correlation between counts and the sunshine duration seems to be less strong and that the wind speed is positively correlated with bike rental demand. Also the the shape of correlation between air temp and the bike rental demand seems to be not linear.

Since all of the variables seem to have an effect they will be included as predictors.

In [94]:
# Creating a subset of only numerical features
numerical_features = one_h_data[['COUNT', 'SUNSHINE_DURATION', 'AIR_TEMP', 'HUMIDITY',
                                 'PRECIPITATION_HEIGHT', 'MEAN_WIND_SPEED']]

# Initializing and plotting a pair grid with regression plots on the upper, hist plots on the main and scatter
# plots on the lower diagonal
g = sns.PairGrid(numerical_features)
g = g.map_upper(sns.regplot, scatter_kws={'alpha':0.7}, line_kws={'color': 'red'})
g = g.map_diag(plt.hist)
g = g.map_lower(plt.scatter);
Categorical Features

From the below boxplots it can be seen that the means in terms of the binary variable school holiday are indifferent. This feature will therefore not be used during analysis.

Bank holiday and rain fallen show a difference in the mean and will be included.

In [95]:
# Creating fig and subplots
fig, ax = plt.subplots(1, 3, figsize=(20, 15), sharey=True)

# Plotting rain fallen boxplots
sns.boxplot(x='RAIN_FALLEN', y='COUNT', data=one_h_data, ax=ax[0])
ax[0].set_title('Rain Fallen')
ax[0].set_xlabel('')

# Plotting bank holiday boxplots
sns.boxplot(x='BANK_HOLIDAY', y='COUNT', data=one_h_data, ax=ax[1])
ax[1].set_title('Bank Holiday')
ax[1].set_xlabel('')
ax[1].set_ylabel('')

# Plotting school holiday boxplots
sns.boxplot(x='SCHOOL_HOLIDAY', y='COUNT', data=one_h_data, ax=ax[2])
ax[2].set_title('School Holiday')
ax[2].set_xlabel('')
ax[2].set_ylabel('');
Autocorrelation

Looking at the autocorrelation it can be seen that there is significant autocorrelation at the first lag. Correlation vanishes significantly after 5 lags. Therefore 5 lags will be included.

In [96]:
# Creating fig and subplots
fig, axes = plt.subplots(2, 5, figsize=(20, 7), sharex=True, sharey=True, dpi=500)

# Plotting the relationship between the actual count and its lagged values
for i, ax in enumerate(axes.flatten()[:10]):
    pd.plotting.lag_plot(one_h_data['COUNT'], lag=i + 1, ax=ax)
    ax.set_title('Lag ' + str(i + 1))
plt.tight_layout();
In [97]:
# Dropping columns that do not have to seem an effect
one_h_data.drop(columns=['SCHOOL_HOLIDAY', 'BANK_HOLIDAY', 'HOUR', 'DAY_OF_WEEK', 'DAY_OF_MONTH',
                         'WEEK_OF_YEAR', 'MONTH_OF_YEAR', 'SEASON'], axis=1, inplace=True)
Generating Lags of order 5
In [98]:
# Generating columns that contain lagged values of the targed variable
one_h_data['lag_1'] = one_h_data['COUNT'].shift(periods=1)
one_h_data['lag_2'] = one_h_data['COUNT'].shift(periods=2)
one_h_data['lag_3'] = one_h_data['COUNT'].shift(periods=3)
one_h_data['lag_4'] = one_h_data['COUNT'].shift(periods=4)
one_h_data['lag_5'] = one_h_data['COUNT'].shift(periods=5)

# Dropping missing values as a result of the shifts and reset the index
one_h_data = one_h_data.dropna(axis=0).reset_index()
Seasons

In discriptive analysis it was already shown that means vary by seasons therefore seasons will be included as one-hot-encoded binary variables.

In [99]:
# Creating a SEASON list based on the specific month
seasons = []
for month in one_h_data.DATE_FROM.dt.strftime('%m').astype('int'):
    if month in [1, 2, 12]:
        seasons.append('WINTER')
    elif month in [3, 4]:
        seasons.append('SPRING')
    elif month in [5, 6, 7, 8, 9]:
        seasons.append('SUMMER')
    elif month in [10, 11]:
        seasons.append('FALL')
        
# Merging the seasons as one-hot-encoded binary variables
one_h_data = one_h_data.merge(pd.get_dummies(seasons, drop_first=False), how='left', left_index=True, right_index=True)
Hour of the Day

Descriptive analysis has shown that the hour of the day varies in means, therefore the hour of the day seems to depict a suitable feature for prediction. Since times do have a cyclical nature (i.e. 11pm is as close to 0 am as 1am) we tranform the hour of the day using the sine and cosine transformation. In this we map each the hour onto a circle such that the lowest value for that variable appears right next to the largest value, by computing the x- and y- component of that point using sin and cos trigonometric functions. This transformation will be applied to all time features in this analysis.

image.png source: Link

In [100]:
# Creating an individual column for hour of the day
one_h_data['HOUR'] = one_h_data.DATE_FROM.dt.strftime('%-H').astype('int')

# Applying sine,cosine transformation on column hour to retain the cyclical nature
one_h_data['HOUR_SIN'] = np.sin(one_h_data.HOUR * (2. * np.pi / 24))
one_h_data['HOUR_COS'] = np.cos(one_h_data.HOUR * (2. * np.pi / 24))
Week of Year
In [101]:
# Creating an individual column for week of the year
one_h_data['WEEK_OF_YEAR'] = one_h_data.DATE_FROM.dt.strftime('%W').astype('int')

# Applying sine,cosine transformation on column WEEK_OF_YEAR to retain the cyclical nature
one_h_data['WEEK_OF_YEAR_SIN'] = np.sin(one_h_data.WEEK_OF_YEAR * (2. * np.pi / 52))
one_h_data['WEEK_OF_YEAR_COS'] = np.cos(one_h_data.WEEK_OF_YEAR * (2. * np.pi / 52))
Day of Week
In [102]:
# Creating an individual column for day of the week
one_h_data['DAY_OF_WEEK'] = one_h_data.DATE_FROM.dt.strftime('%w').astype('int')

# Applying sine,cosine transformation on column DAY_OF_WEEK to retain the cyclical nature
one_h_data['DAY_OF_WEEK_SIN'] = np.sin(one_h_data.DAY_OF_WEEK * (2. * np.pi / 7))
one_h_data['DAY_OF_WEEK_COS'] = np.cos(one_h_data.DAY_OF_WEEK * (2. * np.pi / 7))
In [103]:
# Dropping individual time columns since their transformation will be used
one_h_data.drop(columns=['WEEK_OF_YEAR', 'DAY_OF_WEEK', 'HOUR'], axis=1, inplace=True)

Prediction

The below code defines a function that is used for cross validation. The random cross validation is used to optimize the hyperparamters of the random forest regression. The functionality is described according to PEP8 guidelines.

In [104]:
def TimeSeriesCV(data, params=None, grid_search=False, no_of_forecasts=104, horizon=1, seed=1, func=None,
                 date_col='DATE_FROM', dep_col='COUNT', cv_method='random', initial=None, period=None):
    """
    Function that runs either a rolling or random cross validation. 
    Rolling cross validation fits the model on the specified initial date range and predicts the horizon 
    then refits on the initial plus the specified period and predicts the hroizon. This is repeated until 
    there is no new data left to refit on.
    
    Random cross validation divides the data set in half and selects unique time points as cutoffs.The model
    is then fit on the data points until this cutoff and predicts the horizon. This procedure stops 
    until no cutoff is left. It therefore proceeds similar to a rolling forecast however choosing the cutoffs
    randomly.
    
    If random cross-validation is chosen grid search for optimal hyperparameters with sklearns' random forest 
    regression can be run.
    
    :param data: a dataframe containing all features and data
    :param params: dict containing hyperparameters for sklearns' random forest regression
    :param grid_search: boolean indication whether function is used for grid search
    :param no_of_forecasts: int defining how many cutoffs should be chosen i.e.the number of models to fit
    :param horizon: int defining the amount of periods to forecast ahead
    :param seed: int defining the random state for cutoff selection
    :param func: a model object with hyperparameters set; will be ignored when grid_search is set to true
    :param date_col: string specifying the column containing the date
    :param dep_col: string specifiying the column containing the target value
    :param cv_method: string specifying the cross validation method either 'rolling' or 'random'
    :param initial: int specifying the amount of initial data points used for training in cv method rolling
    :param period: the amount of additional data points to include in the next forecast step or rolling cv
    :return: if grid_search set to true: dict containing the hyperparameters and the corresponding mae 
    :return: if grid_search set to false: a dataframe containing predictions, actuals and corresponding dates
             and a list object containing the mae of predictions for each horizon
    """
    
    # Saving the actual time
    start_time = time.time()
    
    # Checking for method
    if cv_method == 'random':

        # If seed is specified then set the seed
        if seed:
            np.random.seed(seed)
        
        # Selecting random cutoff points 
        cutoffs = np.random.permutation(np.arange(len(data) / 2, len(data)))[:no_of_forecasts]
        
        # Initialising an array of zeors with length horizon
        residuals = np.zeros(horizon)
        
        # Initialising a dataframe to append the results to
        result_frame = pd.DataFrame(
            {'CUTOFF': [], 'PREDICTED_PERIOD': [], 'STEP_AHEAD': [], 'ACTUAL': [], 'PREDICTION': []})

        # Looping through the cutoffs predicting the cutoff + horizon
        for cut in cutoffs:
            cut = int(cut)
            
            # Train test split
            train_data = data[:cut]
            test_data = data[len(train_data):len(train_data) + horizon]

            train_data_y = train_data[dep_col]
            test_data_y = test_data[dep_col]

            train_data_X = train_data.drop(columns=[dep_col, date_col], axis=1)
            test_data_X = test_data.drop(columns=[dep_col, date_col], axis=1)

            # If not grid search use the specified function otherwise run random forest regression
            # with specified hyperparameters
            if not grid_search:
                
                # Fit the model on training data
                reg = func.fit(train_data_X, train_data_y)

                # Predict the test data
                predictions = reg.predict(test_data_X)

            else:
                # Initialising the random forest regression model object 
                func = RandomForestRegressor(n_estimators=params['n_estimators'],
                                             max_features=params['max_features'],
                                             max_depth=params['max_depth'],
                                             min_samples_split=params['min_samples_split'],
                                             min_samples_leaf=params['min_samples_leaf'],
                                             bootstrap=params['bootstrap'],
                                             random_state=123, n_jobs=-1)
                
                # Fit the model on training data
                reg = func.fit(train_data_X, train_data_y)

                # Predict the test data
                predictions = reg.predict(test_data_X)

            # Looping through the preidictions and adding the absolut error in respect horizon
            # and appending the actual predictions to the dataframe
            for counter, value in enumerate(predictions):
                residuals[counter] += abs(value - test_data_y.iloc[counter])
                result_frame = pd.concat([result_frame,
                                          pd.DataFrame({'CUTOFF': [train_data[date_col].iloc[-1]],
                                                        'PREDICTED_PERIOD': [test_data[date_col].iloc[counter]],
                                                        'STEP_AHEAD': [counter + 1],
                                                        'ACTUAL': [test_data_y.iloc[counter]],
                                                        'PREDICTION': [value]})])
                
        # Calculating the mean absolut error with respect to the horizon dividing each absolut error of 
        # each horizon by the number of forecasts
        residuals = residuals / no_of_forecasts
        
        # If not grid search return dataframe and residuals otherwise return a dict
        if not grid_search:
            print('Finished RandomCV in {} seconds'.format(round(time.time() - start_time)))
            return result_frame, residuals
        else:
            print('Finished Regressor with: {} in {} seconds and an accuracy of {}'.format(params, round(
                time.time() - start_time), residuals.mean()))
            return {residuals.mean(): params}

    elif cv_method == 'rolling':
        
        # Initialising an array of zeors with length horizon
        residuals = np.zeros(horizon)
        
        # Initialising a dataframe to append the results to
        result_frame = pd.DataFrame(
            {'CUTOFF': [], 'PREDICTED_PERIOD': [], 'STEP_AHEAD': [], 'ACTUAL': [], 'PREDICTION': []})
        
        # Looping through the data starting with the initial specified and creating forecasts of length
        # horizon after each specified period
        for i in data.iloc[initial::period, :].index:
            cut = int(i)
            
            # Train test split 
            train_data = data[:cut]
            test_data = data[len(train_data):len(train_data) + horizon]

            train_data_y = train_data[dep_col]
            test_data_y = test_data[dep_col]

            train_data_X = train_data.drop(columns=[dep_col, date_col], axis=1)
            test_data_X = test_data.drop(columns=[dep_col, date_col], axis=1)
            
            # Fit the model on training data
            reg = func.fit(train_data_X, train_data_y)

            # Predict the test data
            predictions = reg.predict(test_data_X)
            
            # Looping through the preidictions and adding the absolut error in respect horizon
            # and appending the actual predictions to the dataframe
            for counter, value in enumerate(predictions):
                residuals[counter] += abs(value - test_data_y.iloc[counter])
                result_frame = pd.concat([result_frame,
                                          pd.DataFrame({'CUTOFF': [train_data[date_col].iloc[-1]],
                                                        'PREDICTED_PERIOD': [test_data[date_col].iloc[counter]],
                                                        'STEP_AHEAD': [counter + 1],
                                                        'ACTUAL': [test_data_y.iloc[counter]],
                                                        'PREDICTION': [value]})])

        # Calculating the mean absolut error with respect to the horizon dividing each absolut error of 
        # each horizon by the number of forecasts
        residuals = residuals / len(data.iloc[initial::period, :].index)
        
        print('Finished RollingCV in with {} individual forecasts in {} seconds using method {}'.format(
            len(data.iloc[initial::period, :].index), round(time.time() - start_time), func))
        return result_frame, residuals
    else:
        print('Invalid cv_method')
Linear Regression as Baseline

In the following a simple linear regression model will be fit. Linear regression can be suitable since it is easy to interpret and solves many machine learning problems in a time efficient manner.

In [105]:
# Splitting data into targed and feature data
X = one_h_data.drop(columns=['DATE_FROM', 'COUNT'], axis=1)
y = one_h_data['COUNT']
In [106]:
# Initialising the linear regression model
lr_model = LinearRegression(n_jobs=-1)
In [107]:
# Fitting the model to the whole dataset
lr_model.fit(X, y);
In [108]:
# Calculating the fit of the model
lr_predictions = lr_model.predict(X)

# Calculating the residuals
lr_residuals = y - lr_predictions
In [109]:
# Plotting the fit of the model
pd.DataFrame({'ACTUALS': y, 'PREDICTIONS': lr_predictions}).plot(figsize=(20, 10));

The above graph plots the predictions of the regression model against the actual values. The model fit is already reasonably well. A problem however is that the model predicts negative values in times where bookings are low and misses out on low bookings in time of high demand. Also very high demand patterns are not caught by the dynamics of the model.

In [110]:
pd.DataFrame({'Feature': X.columns, 'Coefficient': lr_model.coef_})
Out[110]:
Feature Coefficient
0 SUNSHINE_DURATION 0.055168
1 AIR_TEMP 0.274342
2 HUMIDITY -0.091256
3 PRECIPITATION_HEIGHT -0.584027
4 RAIN_FALLEN -0.950773
5 MEAN_WIND_SPEED -0.204108
6 lag_1 0.538280
7 lag_2 0.153254
8 lag_3 0.007587
9 lag_4 0.038887
10 lag_5 -0.082604
11 FALL 1.617125
12 SPRING -1.324120
13 SUMMER -0.501538
14 WINTER 0.208532
15 HOUR_SIN -2.849058
16 HOUR_COS -1.997334
17 WEEK_OF_YEAR_SIN 0.972228
18 WEEK_OF_YEAR_COS -1.209031
19 DAY_OF_WEEK_SIN 0.026478
20 DAY_OF_WEEK_COS -0.892951

The above table shows the variables and the resulting coefficients. Again the coefficients are as expected. Important to mention is however that the regression model found the negative influence of mean wind speed which can be explained by the positive correlation of mean wind speed with air temperature. After it was accounted for the positive influence of air temperature the influence of mean wind speed is negative as one would expect it to be.

Random Forest Regression

Due to the probable non-linear relationships that were observed during feature examination and the fact that linear regression predicted negative values, random forest regression seems as an appropriate model for this analysis. It does allow for non-linear relationships and cancels out some key problems that are associated with linear regression (e.g. multicollinearity, strong assumptions about the underlying process) as well as it will predict positive values only if not negative values are present. In addition this method has been performing well on time series data in a whole range of cases.

In [111]:
# Initialising the random forest regression model
rf_model = RandomForestRegressor(n_jobs=-1, random_state=123)
In [112]:
# Fitting the model to the whole dataset
rf_model.fit(X, y);
In [113]:
# Calculating the fit of the model
rf_predictions = rf_model.predict(X)

# Calculating the residuals
rf_residuals = y - rf_predictions 
In [114]:
# Creating fit and subplots
fix, axs = plt.subplots(2, 1, figsize=(17, 15), sharex=True)

# Plotting actuals and fit of linear and random forest regression
pd.DataFrame({'ACTUALS': y,
              'RandomForestRegression': rf_predictions, 
              'LinearRegression': lr_predictions}).set_index(one_h_data['DATE_FROM']).plot(ax=axs[0])
axs[0].set_title('Actuals vs. RandomForest vs. LinearRegression')

# Plotting residuals of linear and random forest regression
pd.DataFrame({'LinearRegression': lr_residuals, 
              'RandomForestRegression': rf_residuals}).set_index(one_h_data['DATE_FROM']).plot(ax=axs[1])
axs[1].set_xlabel('Date')
axs[1].set_title('RandomForest Residuals vs. LinearRegression Residuals');

The above graphs compare the predictive power of linear and random forest regression. It can be seen that random forest regression fits the data better in two ways. Firstly the predictions of random forest regression do not contain negative values. Secondly the residuals are lower and their variance is consistent over time. These are strong indicators that the better model for prediction is random forest regression. This however has to be evaluated by cross-validation.

Grid Search of Random Forest Regresson Hyperparameters

The below code was run in order to search for optimal hyperparameters. Deducing from the sklearn documentation the hyperparameters used are the most important in terms of increasing prediction accuracy. The grid search was conducted using random cross-validation to not overfit the model. Since the process took about 35 Hours the best hyperparameter combination is provided as a dictionary result.

def random_grid_cv(data, seed=1, param_distributions=None, n_iter=100, no_of_forecasts=52, horizon=100):
    """
    Wrapper Function that makes use of TimeSeriesCV allowing randomized grid search selecting a random
    sample of hyperparameter combinations.

    :param data: a dataframe containing all features and data
    :param seed: int defining the random state for cutoff and parameter sample selection
    :param param_distributions: a dict containing the parameters and a list of values to try
    :param n_iter: int specifying the number of different combinations to try
    :param no_of_forecasts: int speciying how many cutoffs should been chosen 
    :param horizon: int specifying the number of periods to forecast ahead
    :return: dict containing the hyperparameters and their specific mae
    """

    start_time = time.time()

    # Drawing a random sample of possible parameter combinations
    param_list = list(ParameterSampler(param_distributions, n_iter=n_iter, random_state=seed))

    # Initialising a list to store the arguments for each individual iteration
    exec_list = list()

    # Looping through the sample of parameter combinations and merge them with the function parameters
    # creating a tuple object for each iteration
    for param_variant in param_list:
        exec_list.append((data, param_variant, True, no_of_forecasts, horizon, seed))

    # Running TimeSeriesCV for each tuple in exec_list
    result = list(starmap(TimeSeriesCV, exec_list))

    print('Grid_search took: {} seconds'.format(round(time.time() - start_time)))
    return result



# Number of trees in random forest
n_estimators = [int(x) for x in np.linspace(start=200, stop=1600, num=15)]
# Number of features to consider at every split
max_features = ['auto', 'sqrt']
# Maximum number of levels in tree
max_depth = [int(x) for x in np.linspace(10, 100, num=10)]
max_depth.append(None)
# Minimum number of samples required to split a node
min_samples_split = [2, 3, 4, 5, 6, 7, 8, 9, 10]
# Minimum number of samples required at each leaf node
min_samples_leaf = [1, 2, 3, 4]
# Method of selecting samples for training each tree
bootstrap = [True, False]
# Create the random grid
random_grid = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf,
               'bootstrap': bootstrap}

grid_results = random_grid_cv(data=one_h_data, 
                              param_distributions=random_grid, 
                              n_iter=215, 
                              no_of_forecasts=52)
Result
{5.330927838827839: {'n_estimators': 200, 
                     'min_samples_split': 9, 
                     'min_samples_leaf': 1, 
                     'max_features': 'sqrt',
                     'max_depth': 70, 
                     'bootstrap': False}}
In [115]:
# Initialising a random forest regression model with optimized hyperparameters
optimized_rf_model = RandomForestRegressor(n_estimators=200, 
                                           min_samples_split=9,
                                           min_samples_leaf=1,
                                           max_features='sqrt',
                                           max_depth=70,
                                           bootstrap=False,
                                           n_jobs=-1,
                                           random_state=123)
In [116]:
# Fitting the model to the whole dataset
optimized_rf_model.fit(X,y);
In [117]:
# Calculating the fit of the model
optimized_rf_predictions = optimized_rf_model.predict(X)

# Calculating the residuals of the model
optimized_rf_residuals = y - optimized_rf_predictions 
In [118]:
# Creating fig and subplots
fix, axs = plt.subplots(2, 1, figsize=(17, 15), sharex=True)

# Plotting actuals and fit of optimized and and default random forest regression
pd.DataFrame({'ACTUALS':y,'OPTIMIZED':optimized_rf_predictions,
              'DEFAULT':rf_predictions}).set_index(one_h_data['DATE_FROM']).plot(ax=axs[0])
axs[0].set_title('Actuals vs. RandomForest Optimized vs. RandomForest Default')

# Plotting residuals  of optimized and and default random forest regression
pd.DataFrame({'DEFAULT':rf_residuals,
              'OPTIMIZED':optimized_rf_residuals}).set_index(one_h_data['DATE_FROM']).plot(ax=axs[1])
axs[1].set_xlabel('Date')
axs[1].set_title('RandomForest Optimized Residuals vs. RandomForest Default Residuals');

The above graphs compare the predictive power of the optimized and the default random forest regression. Optimized parameters have increased the performance of the random forest regression even further in terms of fit. If it increased predictive performance however has to be evaluated.

Feature Importance
In [119]:
# Creating fig
fig = plt.figure(figsize=(20, 10))

# Deriving the feature importances of the optimized and default model
default_importances = list(rf_model.feature_importances_)
optimized_importances = list(optimized_rf_model.feature_importances_)

# Creating an array with range of number of variables
x_values = np.arange(len(default_importances))

# Creating bar plots of optimized and default importances
plt.bar(x_values, default_importances,orientation='vertical', color='blue', width=.5, align='center', label='Default')
plt.bar(x_values + 0.5, optimized_importances, orientation='vertical', color='red', width=.5, align='center', label='Optimized')

plt.xticks(x_values + 0.25, list(X.columns), rotation='vertical')

plt.ylabel('Importance')
plt.xlabel('Variable')
plt.title('Variable Importances')

plt.legend(loc=1)
plt.show();

The graph shows that the optimized hyperparameters distribute the feature importance more evenly. Resulting in a more generalized model. Looking at the feature importance it becomes clear that the first lag exerts the highest influence on bike rental demand prediction followed by lag two. Season as a feature as well as precipitation height and rain fallen could be dropped to achieve a higher efficiency in terms of fitting and predictions times this however is not necessary since the fitting times are appropriate.

Evaluation

The code below runs a rolling one-step-ahead cross validation calculating the mean absolute error of a total number of 119 individual forecasts.

The the minimum of training observations are 8759 observations which is the equivalent to a year of past data. Every 74 periods, which is equivalent to about three days, a one-step-ahead forecast is being produced. The number of 74 periods has been chosen so that different predictions will be carried out at different times of the day.

In [120]:
# Setting the cv_method 
cv_method='rolling'

# Setting the initial parameter
initial=8759

# Setting the period parameter
period=74

# setting the horizon parameter
horizon=1
In [121]:
# Performing a linear regression one-step-ahead prediction
lr_rolling_results, lr_rolling_residuals = TimeSeriesCV(one_h_data, 
                                                        func=lr_model, 
                                                        cv_method=cv_method, 
                                                        initial=initial,
                                                        period=period, 
                                                        horizon=horizon)

# Performing a random forest regression with default hyperparameters one-step-ahead prediction
rf_rolling_results, rf_rolling_residuals = TimeSeriesCV(one_h_data, 
                                                        func=rf_model, 
                                                        cv_method=cv_method, 
                                                        initial=initial,
                                                        period=period, 
                                                        horizon=horizon)

# Performing a random forest regression with optimized hyperparameters one-step-ahead prediction
optimized_rf_rolling_results, optimized_rf_rolling_residuals = TimeSeriesCV(one_h_data, 
                                                                    func=optimized_rf_model,
                                                                    cv_method=cv_method,
                                                                    initial=initial, 
                                                                    period=period,
                                                                    horizon=horizon)
Finished RollingCV in with 119 individual forecasts in 1 seconds using method LinearRegression(copy_X=True, fit_intercept=True, n_jobs=-1, normalize=False)
Finished RollingCV in with 119 individual forecasts in 191 seconds using method RandomForestRegressor(bootstrap=True, ccp_alpha=0.0, criterion='mse',
                      max_depth=None, max_features='auto', max_leaf_nodes=None,
                      max_samples=None, min_impurity_decrease=0.0,
                      min_impurity_split=None, min_samples_leaf=1,
                      min_samples_split=2, min_weight_fraction_leaf=0.0,
                      n_estimators=100, n_jobs=-1, oob_score=False,
                      random_state=123, verbose=0, warm_start=False)
Finished RollingCV in with 119 individual forecasts in 128 seconds using method RandomForestRegressor(bootstrap=False, ccp_alpha=0.0, criterion='mse',
                      max_depth=70, max_features='sqrt', max_leaf_nodes=None,
                      max_samples=None, min_impurity_decrease=0.0,
                      min_impurity_split=None, min_samples_leaf=1,
                      min_samples_split=9, min_weight_fraction_leaf=0.0,
                      n_estimators=200, n_jobs=-1, oob_score=False,
                      random_state=123, verbose=0, warm_start=False)
In [122]:
# Creating a dataframe containg actuals and the preidictions of each model
rf_comparison = pd.DataFrame({'PREDICTED_PERIOD': rf_rolling_results['PREDICTED_PERIOD'],
                              'ACTUAL': rf_rolling_results['ACTUAL'],
                              'RandomForestRegression DEFAULT': rf_rolling_results['PREDICTION'],
                              'RandomForestRegression OPTIMIZED': optimized_rf_rolling_results['PREDICTION'],
                              'LinearRegression': lr_rolling_results['PREDICTION']})

# Plotting actuals and preidicitons of each model
rf_comparison.set_index("PREDICTED_PERIOD").plot(figsize=(17,10));
In [123]:
# Printing the mae of the models
print('Linear Regression: {0:.2f}\n'.format(rf_rolling_residuals[0]))

print('Random Forest Regression wit default parameters: {0:.2f}\n'.format(lr_rolling_residuals[0]))

print('Random Forest Regression with optimized parameters: {0:.2f}\n'.format(optimized_rf_rolling_residuals[0]))

print('Improvement due to optmized Random Forest Regression: {0:.2f}%\n'.format((1-(optimized_rf_rolling_residuals[0]/rf_rolling_residuals[0]))*100))
Linear Regression: 5.45

Random Forest Regression wit default parameters: 6.43

Random Forest Regression with optimized parameters: 5.17

Improvement due to optmized Random Forest Regression: 5.09%

In [124]:
# Printing the nMAE of the models
print('Linear Regression: {0:.2f}\n'.format(rf_rolling_residuals[0] / one_h_data['COUNT'].mean()))

print('Random Forest Regression wit default parameters: {0:.2f}\n'.format(lr_rolling_residuals[0] / one_h_data['COUNT'].mean()))

print('Random Forest Regression with optimized parameters: {0:.2f}\n'.format(optimized_rf_rolling_residuals[0] / one_h_data['COUNT'].mean()))
Linear Regression: 0.27

Random Forest Regression wit default parameters: 0.32

Random Forest Regression with optimized parameters: 0.26

The results of rolling one-step-ahead cross validation suggest that random forest regression with optimized hyperparameters yields the best result with a 5% increase compared to prediction performance. Optimizing hyperparameters also decreased the time needed to fit the model which can be seen as an increase in efficiency. The prediction performance however is very close and shows a significant efficiency advantage in terms of time.

Two Hourly Data

The following section deals with two-hourly data. Since the examination as well as the engineering process are equal to hourly data this part will be unannotated. Evaluation, however will be conducted and discussed.

Feature Engineering

In [125]:
# Dropping columns that do not have to seem an effect
two_h_data.drop(columns=['SCHOOL_HOLIDAY', 'BANK_HOLIDAY'], axis=1, inplace=True)
Generating Lags of order 2
In [126]:
# Generating columns that contain lagged values of the targed variable
two_h_data['lag_1'] = two_h_data['COUNT'].shift(periods=1)
two_h_data['lag_2'] = two_h_data['COUNT'].shift(periods=2)

# Dropping missing values as a result of the shifts and reset the index
two_h_data = two_h_data.dropna(axis=0).reset_index()
Seasons
In [127]:
# Creating a SEASON list based on the specific month
seasons = []
for month in two_h_data.DATE_FROM.dt.strftime('%m').astype('int'):
    if month in [1, 2, 12]:
        seasons.append('WINTER')
    elif month in [3]:
        seasons.append('SPRING')
    elif month in [4, 5, 6, 7, 8, 9]:
        seasons.append('SUMMER')
    elif month in [10, 11]:
        seasons.append('FALL')
        
# Merging the seasons as one-hot-encoded binary variables
two_h_data = two_h_data.merge(pd.get_dummies(seasons, drop_first=False),how='left', left_index=True, right_index=True)
Hour of Day
In [128]:
# Creating an individual column for hour of the day
two_h_data['HOUR'] = two_h_data.DATE_FROM.dt.strftime('%-H').astype('int')

# Applying sine,cosine transformation on column hour to retain the cyclical nature
two_h_data['HOUR_SIN'] = np.sin(two_h_data.HOUR * (2. * np.pi / 24))
two_h_data['HOUR_COS'] = np.cos(two_h_data.HOUR * (2. * np.pi / 24))
Week of Year
In [129]:
# Creating an individual column for week of the year
two_h_data['WEEK_OF_YEAR'] = two_h_data.DATE_FROM.dt.strftime('%W').astype('int')

# Applying sine,cosine transformation on column WEEK_OF_YEAR to retain the cyclical nature
two_h_data['WEEK_OF_YEAR_SIN'] = np.sin(two_h_data.WEEK_OF_YEAR * (2. * np.pi / 52))
two_h_data['WEEK_OF_YEAR_COS'] = np.cos(two_h_data.WEEK_OF_YEAR * (2. * np.pi / 52))
Day of Week
In [130]:
# Creating an individual column for day of the week
two_h_data['DAY_OF_WEEK'] = two_h_data.DATE_FROM.dt.strftime('%w').astype('int')

# Applying sine,cosine transformation on column DAY_OF_WEEK to retain the cyclical nature
two_h_data['DAY_OF_WEEK_SIN'] = np.sin(two_h_data.DAY_OF_WEEK * (2. * np.pi / 7))
two_h_data['DAY_OF_WEEK_COS'] = np.cos(two_h_data.DAY_OF_WEEK * (2. * np.pi / 7))
In [131]:
# Dropping individual time columns since their transformation will be used
two_h_data.drop(columns=['WEEK_OF_YEAR', 'DAY_OF_WEEK', 'HOUR'], axis=1, inplace=True)

Evaluation

The code below runs a rolling one-step-ahead cross validation calculating the mean absolute error of a total number of 142 individual forecasts.

The the minimum of training observations are 4379 observations which is the equivalent to a year of past data. Every 31 periods, which is equivalent to about 2,6 days, a one-step-ahead forecast is being produced. The number of 31 periods has been chosen so that different predictions will be carried out at different times of the day.

In [132]:
# Setting the cv_method 
cv_method='rolling'

# Setting the initial parameter
initial=4379

# Setting the period parameter
period=31

# setting the horizon parameter
horizon=1
In [133]:
# Performing a linear regression one-step-ahead prediction
two_h_lr_rolling_results, two_h_lr_rolling_residuals = TimeSeriesCV(two_h_data, 
                                                                    func=lr_model, 
                                                                    cv_method=cv_method, 
                                                                    initial=initial,
                                                                    period=period, 
                                                                    horizon=horizon)

# Performing a random forest regression with default hyperparameters one-step-ahead prediction
two_h_rf_rolling_results, two_h_rf_rolling_residuals = TimeSeriesCV(two_h_data, 
                                                                    func=rf_model, 
                                                                    cv_method=cv_method, 
                                                                    initial=initial,
                                                                    period=period, 
                                                                    horizon=horizon)

# Performing a random forest regression with optimized hyperparameters one-step-ahead prediction
two_h_optimized_rf_rolling_results, two_h_optimized_rf_residuals = TimeSeriesCV(two_h_data, 
                                                                                func=optimized_rf_model,
                                                                                cv_method=cv_method,
                                                                                initial=initial, 
                                                                                period=period,
                                                                                horizon=horizon)
Finished RollingCV in with 142 individual forecasts in 1 seconds using method LinearRegression(copy_X=True, fit_intercept=True, n_jobs=-1, normalize=False)
Finished RollingCV in with 142 individual forecasts in 114 seconds using method RandomForestRegressor(bootstrap=True, ccp_alpha=0.0, criterion='mse',
                      max_depth=None, max_features='auto', max_leaf_nodes=None,
                      max_samples=None, min_impurity_decrease=0.0,
                      min_impurity_split=None, min_samples_leaf=1,
                      min_samples_split=2, min_weight_fraction_leaf=0.0,
                      n_estimators=100, n_jobs=-1, oob_score=False,
                      random_state=123, verbose=0, warm_start=False)
Finished RollingCV in with 142 individual forecasts in 93 seconds using method RandomForestRegressor(bootstrap=False, ccp_alpha=0.0, criterion='mse',
                      max_depth=70, max_features='sqrt', max_leaf_nodes=None,
                      max_samples=None, min_impurity_decrease=0.0,
                      min_impurity_split=None, min_samples_leaf=1,
                      min_samples_split=9, min_weight_fraction_leaf=0.0,
                      n_estimators=200, n_jobs=-1, oob_score=False,
                      random_state=123, verbose=0, warm_start=False)
In [134]:
# Creating a dataframe containg actuals and the preidictions of each model
two_h_rf_comparison = pd.DataFrame({'PREDICTED_PERIOD': two_h_rf_rolling_results['PREDICTED_PERIOD'],
                                    'ACTUAL': two_h_rf_rolling_results['ACTUAL'],
                                    'RandomForestRegression DEFAULT': two_h_rf_rolling_results['PREDICTION'],
                                    'RandomForestRegression OPTIMIZED': two_h_optimized_rf_rolling_results['PREDICTION'],
                                    'LinearRegression': two_h_lr_rolling_results['PREDICTION']
                                    })
# Plotting actuals and preidicitons of each model
two_h_rf_comparison.set_index('PREDICTED_PERIOD').plot(figsize=(17, 10));
In [135]:
# Printing the mae of the models
print('Linear Regression: {0:.2f}\n'.format(two_h_rf_rolling_residuals[0]))

print('Random Forest Regression wit default parameters: {0:.2f}\n'.format(two_h_lr_rolling_residuals[0]))

print('Random Forest Regression with optimized parameters: {0:.2f}\n'.format(two_h_optimized_rf_residuals[0]))
Linear Regression: 8.60

Random Forest Regression wit default parameters: 11.76

Random Forest Regression with optimized parameters: 8.23

In [136]:
# Printing the nMAE of the models
print('Linear Regression: {0:.2f}\n'.format(two_h_rf_rolling_residuals[0] / two_h_data['COUNT'].mean()))

print('Random Forest Regression wit default parameters: {0:.2f}\n'.format(two_h_lr_rolling_residuals[0] / two_h_data['COUNT'].mean()))

print('Random Forest Regression with optimized parameters: {0:.2f}\n'.format(two_h_optimized_rf_residuals[0] / two_h_data['COUNT'].mean()))
Linear Regression: 0.21

Random Forest Regression wit default parameters: 0.29

Random Forest Regression with optimized parameters: 0.21

Again random forest regression with optimized hyper-parameters outperforms the other models. But again only a small increase in predictive performance compared to linear regression is achieved.

Six Hourly Data

Feature Engineering

Generating Lag of Order 4
In [137]:
# Generating columns that contain lagged values of the targed variable
six_h_data['lag_4'] = six_h_data['COUNT'].shift(periods=4)

# Applying sine,cosine transformation on column hour to retain the cyclical nature
six_h_data = six_h_data.dropna(axis=0).reset_index()
Seasons
In [138]:
# Creating a SEASON list based on the specific month
seasons = []
for month in six_h_data.DATE_FROM.dt.strftime('%m').astype('int'):
    if month in [1, 2, 12]:
        seasons.append('WINTER')
    elif month in [3]:
        seasons.append('SPRING')
    elif month in [4, 5, 6, 7, 8, 9]:
        seasons.append('SUMMER')
    elif month in [10, 11]:
        seasons.append('FALL')

# Merging the seasons as one-hot-encoded binary variables
six_h_data = six_h_data.merge(pd.get_dummies(seasons, drop_first=False),how='left', left_index=True, right_index=True)
Week of Year
In [139]:
# Creating an individual column for week of the year
six_h_data['WEEK_OF_YEAR'] = six_h_data.DATE_FROM.dt.strftime('%W').astype('int')

# Applying sine,cosine transformation on column WEEK_OF_YEAR to retain the cyclical nature
six_h_data['WEEK_OF_YEAR_SIN'] = np.sin(six_h_data.WEEK_OF_YEAR * (2. * np.pi / 52))
six_h_data['WEEK_OF_YEAR_COS'] = np.cos(six_h_data.WEEK_OF_YEAR * (2. * np.pi / 52))
Day of Week
In [140]:
# Creating an individual column for dasy of the week
six_h_data['DAY_OF_WEEK'] = six_h_data.DATE_FROM.dt.strftime('%w').astype('int')

# Applying sine,cosine transformation on column DAY_OF_WEEK to retain the cyclical nature
six_h_data['DAY_OF_WEEK_SIN'] = np.sin(six_h_data.DAY_OF_WEEK * (2. * np.pi / 7))
six_h_data['DAY_OF_WEEK_COS'] = np.cos(six_h_data.DAY_OF_WEEK * (2. * np.pi / 7))
In [141]:
# Dropping individual time columns since their transformation will be used
six_h_data.drop(columns=['WEEK_OF_YEAR', 'DAY_OF_WEEK'], axis=1, inplace=True)

Evaluation

In [142]:
# Setting the cv_method 
cv_method='rolling'

# Setting the initial parameter
initial=1457

# Setting the period parameter
period=11

# setting the horizon parameter
horizon=1
In [143]:
# Performing a linear regression one-step-ahead prediction
six_h_lr_rolling_results, six_h_lr_rolling_residuals = TimeSeriesCV(six_h_data, 
                                                                    func=lr_model, 
                                                                    cv_method=cv_method, 
                                                                    initial=initial,
                                                                    period=period, 
                                                                    horizon=horizon)

# Performing a random forest regression with default hyperparameters one-step-ahead prediction
six_h_rf_rolling_results, six_h_rf_rolling_residuals = TimeSeriesCV(six_h_data, 
                                                                    func=rf_model, 
                                                                    cv_method=cv_method, 
                                                                    initial=initial,
                                                                    period=period, 
                                                                    horizon=horizon)

# Performing a random forest regression with optimized hyperparameters one-step-ahead prediction
six_h_optimized_rf_rolling_results, six_h_optimized_rf_residuals = TimeSeriesCV(six_h_data, 
                                                                                func=optimized_rf_model,
                                                                                cv_method=cv_method,
                                                                                initial=initial, 
                                                                                period=period,
                                                                                horizon=horizon)
Finished RollingCV in with 133 individual forecasts in 1 seconds using method LinearRegression(copy_X=True, fit_intercept=True, n_jobs=-1, normalize=False)
Finished RollingCV in with 133 individual forecasts in 52 seconds using method RandomForestRegressor(bootstrap=True, ccp_alpha=0.0, criterion='mse',
                      max_depth=None, max_features='auto', max_leaf_nodes=None,
                      max_samples=None, min_impurity_decrease=0.0,
                      min_impurity_split=None, min_samples_leaf=1,
                      min_samples_split=2, min_weight_fraction_leaf=0.0,
                      n_estimators=100, n_jobs=-1, oob_score=False,
                      random_state=123, verbose=0, warm_start=False)
Finished RollingCV in with 133 individual forecasts in 50 seconds using method RandomForestRegressor(bootstrap=False, ccp_alpha=0.0, criterion='mse',
                      max_depth=70, max_features='sqrt', max_leaf_nodes=None,
                      max_samples=None, min_impurity_decrease=0.0,
                      min_impurity_split=None, min_samples_leaf=1,
                      min_samples_split=9, min_weight_fraction_leaf=0.0,
                      n_estimators=200, n_jobs=-1, oob_score=False,
                      random_state=123, verbose=0, warm_start=False)
In [144]:
# Creating a dataframe containg actuals and the preidictions of each model
six_h_rf_comparison = pd.DataFrame({'PREDICTED_PERIOD': six_h_rf_rolling_results['PREDICTED_PERIOD'],
                                    'ACTUAL': six_h_rf_rolling_results['ACTUAL'],
                                    'RandomForestRegression DEFAULT': six_h_rf_rolling_results['PREDICTION'],
                                    'RandomForestRegression OPTIMIZED': six_h_optimized_rf_rolling_results['PREDICTION'],
                                    'LinearRegression': six_h_lr_rolling_results['PREDICTION']
                                    })
# Plotting actuals and preidicitons of each model
six_h_rf_comparison.set_index('PREDICTED_PERIOD').plot(figsize=(17, 10));
In [145]:
# Printing the mae of the models
print('Linear Regression: {0:.2f}\n'.format(six_h_rf_rolling_residuals[0]))

print('Random Forest Regression wit default parameters: {0:.2f}\n'.format(six_h_lr_rolling_residuals[0]))

print('Random Forest Regression with optimized parameters: {0:.2f}\n'.format(six_h_optimized_rf_residuals[0]))
Linear Regression: 29.58

Random Forest Regression wit default parameters: 32.47

Random Forest Regression with optimized parameters: 27.64

In [146]:
# Printing the nMAE of the models
print('Linear Regression: {0:.2f}\n'.format(six_h_rf_rolling_residuals[0] / six_h_data['COUNT'].mean()))

print('Random Forest Regression wit default parameters: {0:.2f}\n'.format(six_h_lr_rolling_residuals[0] / six_h_data['COUNT'].mean()))

print('Random Forest Regression with optimized parameters: {0:.2f}\n'.format(six_h_optimized_rf_residuals[0] / six_h_data['COUNT'].mean()))
Linear Regression: 0.25

Random Forest Regression wit default parameters: 0.27

Random Forest Regression with optimized parameters: 0.23

Also at a six-hour resolution the same patterns can be observerd

Discussion of Prediction results

This work has been a first attempt to predict demand in the case of bikes-haring in the city of Kassel. For prediction different approaches have been chosen depending on the resolution and the process of the underlying data.

For daily data time series algorithms have been used since the minimum requirement of most time series models is that data is at minimum of daily resolution if not weekly or monthly. The algorithms employed were SARIMAx and Facebook's Prophet. Both of them perform reasonably well in predicting daily bike rental demand with a MAE of around 60 bikes a day. These algorithms however, could be improved further by utilizing more data since this will allow the algorithms to truly learn the process in terms of seasonality (i.e monthly seasonality is based on two observations since there are only two realisations of the same month in the data). The SARIMAX could benefit of a grid search for optimal hyper-parameters concerning the order to further improve prediction results. In addition other hyper-parameters of prophet could be examined. Generally. it can be said that prophet delivers a better performance in terms of prediction and computational resources. It is therefore a more efficient and effective and should be employed as the final model.

For sub-daily data most time series algorithms are not appropriate. Therefore linear regression and random forest regression have been utilized for prediction. Linear regression in this case was intended as a baseline model to beat. Random forest regression was trained and a grid search for optimal hyper-parameters has been conducted leading to a small improvement in fit. Contrary to expectations, the linear regression performed close to the more complicated random forest regression. However it was seen that random forest regression explains the data better which can be seen by the much better fit of the model. Since also the prediction is better by around 5% random forest regression should be chosen as the final model. Further improvements could be generated by requiring a calendar of special festive occasions of the city Kassel which was not available for the time frame of the data.

Finally it can be said that prediction could be carried out in a relatively precise manner. However to fully utilize the results of the predictions tools have to be developed that predicts the demand per station as the aim is to increase service quality by reallocating unused bikes. This however requires more complex models that take the fact into account that bikes can only be rented when bikes are present, therefore current usage patterns might not depict the actual demand.

Comparison of mean absolute error (MAE) of different algorithms at different resolutions

Daily Hourly Two-hourly Six-hourly
SARIMAX 63.69 - - -
Prophet 55.37 - - -
Linear regression - 5.45 8.60 29.58
Random forest regression - 6.43 11.76 32.47
Random forest regression (optimized) - 5.17 8.23 27.64